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Preface 


Just a few years ago, there were no legions of deep learning scientists developing intelli- 
gent products and services at major companies and startups. When we entered the field, 
machine learning did not command headlines in daily newspapers. Our parents had no idea 
what machine learning was, let alone why we might prefer it to a career in medicine or law. 
Machine learning was a blue skies academic discipline whose industrial significance was 
limited to a narrow set of real-world applications, including speech recognition and com- 
puter vision. Moreover, many of these applications required so much domain knowledge 
that they were often regarded as entirely separate areas for which machine learning was 
one small component. At that time, neural networks—the predecessors of the deep learn- 
ing methods that we focus on in this book—were generally regarded as outmoded. 


Yet in just few years, deep learning has taken the world by surprise, driving rapid progress 
in such diverse fields as computer vision, natural language processing, automatic speech 
recognition, reinforcement learning, and biomedical informatics. Moreover, the success 
of deep learning in so many tasks of practical interest has even catalyzed developments in 
theoretical machine learning and statistics. With these advances in hand, we can now build 
cars that drive themselves with more autonomy than ever before (though less autonomy 
than some companies might have you believe), dialogue systems that debug code by asking 
clarifying questions, and software agents beating the best human players in the world at 
board games such as Go, a feat once thought to be decades away. Already, these tools exert 
ever-wider influence on industry and society, changing the way movies are made, diseases 
are diagnosed, and playing a growing role in basic sciences—from astrophysics, to climate 
modeling, to weather prediction, to biomedicine. 


About This Book 
a) 


This book represents our attempt to make deep learning approachable, teaching you the 
concepts, the context, and the code. 


One Medium Combining Code, Math, and HTML 


For any computing technology to reach its full impact, it must be well understood, well 
documented, and supported by mature, well-maintained tools. The key ideas should be 
clearly distilled, minimizing the onboarding time needed to bring new practitioners up to 
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date. Mature libraries should automate common tasks, and exemplar code should make 
it easy for practitioners to modify, apply, and extend common applications to suit their 
needs. 


As an example, take dynamic web applications. Despite a large number of companies, 
such as Amazon, developing successful database-driven web applications in the 1990s, the 
potential of this technology to aid creative entrepreneurs was realized to a far greater degree 
only in the past ten years, owing in part to the development of powerful, well-documented 
frameworks. 


Testing the potential of deep learning presents unique challenges because any single appli- 
cation brings together various disciplines. Applying deep learning requires simultaneously 
understanding (i) the motivations for casting a problem in a particular way; (ii) the math- 
ematical form of a given model; (iii) the optimization algorithms for fitting the models to 
data; (iv) the statistical principles that tell us when we should expect our models to general- 
ize to unseen data and practical methods for certifying that they have, in fact, generalized; 
and (v) the engineering techniques required to train models efficiently, navigating the pit- 
falls of numerical computing and getting the most out of available hardware. Teaching the 
critical thinking skills required to formulate problems, the mathematics to solve them, and 
the software tools to implement those solutions all in one place presents formidable chal- 
lenges. Our goal in this book is to present a unified resource to bring would-be practitioners 
up to speed. 


When we started this book project, there were no resources that simultaneously (i) remained 
up to date; (ii) covered the breadth of modern machine learning practices with sufficient 
technical depth; and (iii) interleaved exposition of the quality one expects of a textbook 
with the clean runnable code that one expects of a hands-on tutorial. We found plenty of 
code examples illustrating how to use a given deep learning framework (e.g., how to do 
basic numerical computing with matrices in TensorFlow) or for implementing particular 
techniques (e.g., code snippets for LeNet, AlexNet, ResNet, etc.) scattered across various 
blog posts and GitHub repositories. However, these examples typically focused on how to 
implement a given approach, but left out the discussion of why certain algorithmic deci- 
sions are made. While some interactive resources have popped up sporadically to address a 
particular topic, e.g., the engaging blog posts published on the website Distill!, or personal 
blogs, they only covered selected topics in deep learning, and often lacked associated code. 
On the other hand, while several deep learning textbooks have emerged—e.g., Goodfellow 
et al. (2016), which offers a comprehensive survey on the basics of deep learning—these 
resources do not marry the descriptions to realizations of the concepts in code, sometimes 
leaving readers clueless as to how to implement them. Moreover, too many resources are 
hidden behind the paywalls of commercial course providers. 


We set out to create a resource that could (i) be freely available for everyone; (ii) offer suffi- 
cient technical depth to provide a starting point on the path to actually becoming an applied 
machine learning scientist; (iii) include runnable code, showing readers how to solve prob- 
lems in practice; (iv) allow for rapid updates, both by us and also by the community at large; 
and (v) be complemented by a forum? for interactive discussion of technical details and to 
answer questions. 
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These goals were often in conflict. Equations, theorems, and citations are best managed and 
laid out in LaTeX. Code is best described in Python. And webpages are native in HTML 
and JavaScript. Furthermore, we want the content to be accessible both as executable code, 
as a physical book, as a downloadable PDF, and on the Internet as a website. No workflows 
seemed suited to these demands, so we decided to assemble our own (Section B.6). We 
settled on GitHub to share the source and to facilitate community contributions; Jupyter 
notebooks for mixing code, equations and text; Sphinx as a rendering engine; and Discourse 
as a discussion platform. While our system is not perfect, these choices strike a compromise 
among the competing concerns. We believe that Dive into Deep Learning might be the first 
book published using such an integrated workflow. 


Learning by Doing 


Many textbooks present concepts in succession, covering each in exhaustive detail. For 
example, the excellent textbook of Bishop (2006), teaches each topic so thoroughly that 
getting to the chapter on linear regression requires a nontrivial amount of work. While 
experts love this book precisely for its thoroughness, for true beginners, this property limits 
its usefulness as an introductory text. 


In this book, we teach most concepts just in time. In other words, you will learn concepts 
at the very moment that they are needed to accomplish some practical end. While we 
take some time at the outset to teach fundamental preliminaries, like linear algebra and 
probability, we want you to taste the satisfaction of training your first model before worrying 
about more esoteric concepts. 


Aside from a few preliminary notebooks that provide a crash course in the basic mathe- 
matical background, each subsequent chapter both introduces a reasonable number of new 
concepts and provides several self-contained working examples, using real datasets. This 
presented an organizational challenge. Some models might logically be grouped together 
in a single notebook. And some ideas might be best taught by executing several models 
in succession. By contrast, there is a big advantage to adhering to a policy of one working 
example, one notebook: This makes it as easy as possible for you to start your own research 
projects by leveraging our code. Just copy a notebook and start modifying it. 


Throughout, we interleave the runnable code with background material as needed. In gen- 
eral, we err on the side of making tools available before explaining them fully (often filling 
in the background later). For instance, we might use stochastic gradient descent before 
explaining why it is useful or offering some intuition for why it works. This helps to give 
practitioners the necessary ammunition to solve problems quickly, at the expense of requir- 
ing the reader to trust us with some curatorial decisions. 


This book teaches deep learning concepts from scratch. Sometimes, we delve into fine 
details about models that would typically be hidden from users by modern deep learning 
frameworks. This comes up especially in the basic tutorials, where we want you to un- 
derstand everything that happens in a given layer or optimizer. In these cases, we often 
present two versions of the example: one where we implement everything from scratch, 
relying only on NumPy-like functionality and automatic differentiation, and a more prac- 
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tical example, where we write succinct code using the high-level APIs of deep learning 
frameworks. After explaining how some component works, we rely on the high-level API 
in subsequent tutorials. 


Content and Structure 


The book can be divided into roughly three parts, dealing with preliminaries, deep learning 
techniques, and advanced topics focused on real systems and applications (Fig. 1). 


1. Introduction 


2. Preliminaries 


3—4. Linear Neural 
Networks 


5. Multilayer Perceptrons 


6. Builders’ Guide 


7. Convolutional Neural 9. Recurrent Neural 
Networks Networks 


8. Modern Convolutional 
Neural Networks 


= 


Book structure. 


10. Modern Recurrent 


Neural Networks oe 
. Attention Mechanisms 


and Transformers 


e Part 1: Basics and Preliminaries. Chapter 1 is an introduction to deep learning. Then, 
in Chapter 2, we quickly bring you up to speed on the prerequisites required for hands- 
on deep learning, such as how to store and manipulate data, and how to apply vari- 
ous numerical operations based on elementary concepts from linear algebra, calculus, 
and probability. Chapter 3 and Chapter 5 cover the most fundamental concepts and 
techniques in deep learning, including regression and classification; linear models; 
multilayer perceptrons; and overfitting and regularization. 


e Part 2: Modern Deep Learning Techniques. Chapter 6 describes the key computa- 
tional components of deep learning systems and lays the groundwork for our sub- 
sequent implementations of more complex models. Next, Chapter 7 and Chapter 8 
present convolutional neural networks (CNNs), powerful tools that form the back- 
bone of most modern computer vision systems. Similarly, Chapter 9 and Chapter 10 
introduce recurrent neural networks (RNNs), models that exploit sequential (e.g., tem- 
poral) structure in data and are commonly used for natural language processing and 
time series prediction. In Chapter 11, we describe a relatively new class of models, 
based on so-called attention mechanisms, that has displaced RNNs as the dominant 
architecture for most natural language processing tasks. These sections will bring 
you up to speed on the most powerful and general tools that are widely used by deep 
learning practitioners. 
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e Part3: Scalability, Efficiency, and Applications (available online). In Chapter 12, we 
discuss several common optimization algorithms used to train deep learning models. 
Next, in Chapter 13, we examine several key factors that influence the computational 
performance of deep learning code. Then, in Chapter 14, we illustrate major applica- 
tions of deep learning in computer vision. Finally, in Chapter 15 and Chapter 16, we 
demonstrate how to pretrain language representation models and apply them to natural 
language processing tasks. 


Code 


Most sections of this book feature executable code. We believe that some intuitions are best 
developed via trial and error, tweaking the code in small ways and observing the results. 
Ideally, an elegant mathematical theory might tell us precisely how to tweak our code to 
achieve a desired result. However, deep learning practitioners today must often tread where 
no solid theory provides guidance. Despite our best attempts, formal explanations for the 
efficacy of various techniques are still lacking, for a variety of reasons: the mathematics to 
characterize these models can be so difficult; the explanation likely depends on properties 
of the data that currently lack clear definitions; and serious inquiry on these topics has 
only recently kicked into high gear. We are hopeful that as the theory of deep learning 
progresses, each future edition of this book will provide insights that eclipse those presently 
available. 


To avoid unnecessary repetition, we capture some of our most frequently imported and used 
functions and classes in the d21 package. Throughout, we mark blocks of code (such as 
functions, classes, or collection of import statements) with #@save to indicate that they will 
be accessed later via the d21 package. We offer a detailed overview of these classes and 
functions in Section B.8. The d21 package is lightweight and only requires the following 
dependencies: 


#@save 

import collections 

import hashlib 

import inspect 

import math 

import os 

import random 

import re 

import shutil 

import sys 

import tarfile 

import time 

import zipfile 

from collections import defaultdict 
import pandas as pd 

import requests 

from IPython import display 

from matplotlib import pyplot as plt 
from matplotlib_inline import backend_inline 


d21 = sys.modules[__name__] 
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Most of the code in this book is based on PyTorch, a popular open-source framework that 
has been enthusiastically embraced by the deep learning research community. All of the 
code in this book has passed tests under the latest stable version of PyTorch. However, due 
to the rapid development of deep learning, some code in the print edition may not work 
properly in future versions of PyTorch. We plan to keep the online version up to date. 
In case you encounter any problems, please consult Installation (page xxxiv) to update 
your code and runtime environment. Below lists dependencies in our PyTorch implemen- 
tation. 


#@save 

import numpy as np 

import torch 

import torchvision 

from PIL import Image 

from scipy.spatial import distance_matrix 
from torch import nn 

from torch.nn import functional as F 

from torchvision import transforms 


Target Audience 


This book is for students (undergraduate or graduate), engineers, and researchers, who seek 
a solid grasp of the practical techniques of deep learning. Because we explain every con- 
cept from scratch, no previous background in deep learning or machine learning is required. 
Fully explaining the methods of deep learning requires some mathematics and program- 
ming, but we will only assume that you enter with some basics, including modest amounts 
of linear algebra, calculus, probability, and Python programming. Just in case you have 
forgotten anything, the online Appendix* provides a refresher on most of the mathematics 
you will find in this book. Usually, we will prioritize intuition and ideas over mathematical 
rigor. If you would like to extend these foundations beyond the prerequisites to understand 
our book, we happily recommend some other terrific resources: Linear Analysis by Bol- 
lobas (1999) covers linear algebra and functional analysis in great depth. All of Statistics 
(Wasserman, 2013) provides a marvelous introduction to statistics. Joe Blitzstein’s books 
and courses® on probability and inference are pedagogical gems. And if you have not used 
Python before, you may want to peruse this Python tutorial”. 


Notebooks, Website, GitHub, and Forum 


All of our notebooks are available for download on the D2L.ai website® and on GitHub?. 
Associated with this book, we have launched a discussion forum, located at discuss.d2l.ai 
10. Whenever you have questions on any section of the book, you can find a link to the 
associated discussion page at the end of each notebook. 
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Summary 
eee SS SS 


Deep learning has revolutionized pattern recognition, introducing technology that now 
powers a wide range of technologies, in such diverse fields as computer vision, natural 
language processing, and automatic speech recognition. To successfully apply deep learn- 
ing, you must understand how to cast a problem, the basic mathematics of modeling, the 
algorithms for fitting your models to data, and the engineering techniques to implement it 
all. This book presents a comprehensive resource, including prose, figures, mathematics, 
and code, all in one place. 


Exercises 


o Bae Register an account on the discussion forum of this book discuss.d21.ai!?. 


m 


2. Install Python on your computer. 


xxxiii Preface 


3. Follow the links at the bottom of the section to the forum, where you will be able to 
seek out help and discuss the book and find answers to your questions by engaging the 
authors and broader community. 
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Installation 


In order to get up and running, we will need an environment for running Python, the Jupyter 
Notebook, the relevant libraries, and the code needed to run the book itself. 


Installing Miniconda 


=, Your simplest option is to install Miniconda!*. Note that the Python 3.x version is required. 
* You can skip the following steps if your machine already has conda installed. 


Visit the Miniconda website and determine the appropriate version for your system based 
on your Python 3.x version and machine architecture. Suppose that your Python version is 
3.9 (our tested version). If you are using macOS, you would download the bash script whose 
name contains the strings “MacOSX”, navigate to the download location, and execute the 
installation as follows (taking Intel Macs as an example): 


# The file name is subject to changes 
sh Miniconda3-py39_4.12.0-MacOSX-x86_64.sh -b 


A Linux user would download the file whose name contains the strings “Linux” and execute 
the following at the download location: 


# The file name is subject to changes 
sh Miniconda3-py39_4.12.@-Linux-x86_64.sh -b 


A Windows user would download and install Miniconda by following its online instructions 


% 14. On Windows, you may search for cmd to open the Command Prompt (command-line 


interpreter) for running commands. 


Next, initialize the shell so we can run conda directly. 


~/miniconda3/bin/conda init 


Then close and reopen your current shell. You should be able to create a new environment 
as follows: 
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conda create --name d21 python=3.9 -y 


Now we can activate the d21 environment: 


conda activate d21 


Installing the Deep Learning Framework and the 
d21 Package 


Before installing any deep learning framework, please first check whether or not you have 
proper GPUs on your machine (the GPUs that power the display on a standard laptop are 
not relevant for our purposes). For example, if your computer has NVIDIA GPUs and has 
installed CUDA !°, then you are all set. If your machine does not house any GPU, there 
is no need to worry just yet. Your CPU provides more than enough horsepower to get you 
through the first few chapters. Just remember that you will want to access GPUs before 
running larger models. 


You can install PyTorch (the specified versions are tested at the time of writing) with either 
CPU or GPU support as follows: 


pip install torch==2.0.@ torchvision==0.15.1 


Our next step is to install the d21 package that we developed in order to encapsulate fre- 
quently used functions and classes found throughout this book: 


pip install d21==1.0.3 


Downloading and Running the Code 
SS 


Next, you will want to download the notebooks so that you can run each of the book’s 

code blocks. Simply click on the “Notebooks” tab at the top of any HTML page on the 

D2L.ai website !® to download the code and then unzip it. Alternatively, you can fetch the 
16 EHE notebooks from the command line as follows: 


mkdir d2l-en && cd d2l-en 

curl https://d21.ai/d21l-en-1.0.3.zip -o d2l-en.zip 
unzip d2l-en.zip && rm d2l-en.zip 

cd pytorch 
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If you do not already have unzip installed, first run sudo apt-get install unzip. Now 
we can start the Jupyter Notebook server by running: 


jupyter notebook 


At this point, you can open http://localhost:8888 (it may have already opened automatically) 
in your web browser. Then we can run the code for each section of the book. Whenever 
you open a new command line window, you will need to execute conda activate d21 
to activate the runtime environment before running the D2L notebooks, or updating your 
packages (either the deep learning framework or the d21 package). To exit the environment, 
run conda deactivate. 
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Notation 


Throughout this book, we adhere to the following notational conventions. Note that some 
of these symbols are placeholders, while others refer to specific objects. As a general rule 
of thumb, the indefinite article “a” often indicates that the symbol is a placeholder and that 
similarly formatted symbols can denote other objects of the same type. For example, “x: a 
scalar” means that lowercased letters generally represent scalar values, but “Z: the set of 
integers” refers specifically to the symbol Z. 


Numerical Objects 
[È 


e x: a scalar 

e x: a vector 

e X: a matrix 

e X: a general tensor 


e I: the identity matrix (of some given dimension), i.e., a square matrix with | on all 
diagonal entries and 0 on all off-diagonals 


e x;, [x];: the i™ element of vector x 


© xij, xi jp [X], [X]; j: the element of matrix X at row 7 and column J. 


Set Theory 
SSS as 


e X: aset 

e Z: the set of integers 

e Z*: the set of positive integers 
e R: the set of real numbers 


e R”: the set of n-dimensional vectors of real numbers 


xxxviii 


Notation 


R?*°: The set of matrices of real numbers with a rows and b columns 
|X|: cardinality (number of elements) of set X 

AU B: union of sets A and B 

AN B: intersection of sets A and B 


A \ B: set subtraction of B from A (contains only those elements of A that do not 
belong to 8) 


Functions and Operators 


fC): a function 

log(-): the natural logarithm (base e) 
log,(-): logarithm to base 2 

exp(-): the exponential function 


1(-): the indicator function; evaluates to 1 if the boolean argument is true, and 0 other- 
wise 


1y(z): the set-membership indicator function; evaluates to 1 if the element z belongs to 
the set X and 0 otherwise 


(-)": transpose of a vector or a matrix 

X7!: inverse of matrix X 

©: Hadamard (elementwise) product 

[-,-]: concatenation 

Il- lp: £p norm 

|| - ||: € norm 

(x, y): inner (dot) product of vectors x and y 
>: summation over a collection of elements 


[[: product over a collection of elements 


Ei an equality asserted as a definition of the symbol on the left-hand side 
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Calculus 
ssa 


a: derivative of y with respect to x 


oy 


° Jx 


: partial derivative of y with respect to x 
e Vxy: gradient of y with respect to x 
° J á f(x) dx: definite integral of f from a to b with respect to x 


e f f(x) dx: indefinite integral of f with respect to x 


Probability and Information Theory 
a ee 


e X: a random variable 
e P: a probability distribution 


e X ~ P: the random variable X follows distribution P 


P(X = x): the probability assigned to the event where random variable X takes value x 


e P(X | Y): the conditional probability distribution of X given Y 


p(-): a probability density function (PDF) associated with distribution P 


E[X]: expectation of a random variable X 


e X L Y: random variables X and Y are independent 


X LY | Z: random variables X and Y are conditionally independent given Z 


e ox: standard deviation of random variable X 


Var(X): variance of random variable X, equal to o 


e Cov(X,Y): covariance of random variables X and Y 


Cov(X,Y) 


e p(X,Y): the Pearson correlation coefficient between X and Y, equals ae 


H(X): entropy of random variable X 


Dx (P||Q): the KL-divergence (or relative entropy) from distribution Q to distribution 
P 
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Until recently, nearly every computer program that you might have interacted with during an 
ordinary day was coded up as a rigid set of rules specifying precisely how it should behave. 
Say that we wanted to write an application to manage an e-commerce platform. After 
huddling around a whiteboard for a few hours to ponder the problem, we might settle on 
the broad strokes of a working solution, for example: (i) users interact with the application 
through an interface running in a web browser or mobile application; (ii) our application 
interacts with a commercial-grade database engine to keep track of each user’s state and 
maintain records of historical transactions; and (iii) at the heart of our application, the 
business logic (you might say, the brains) of our application spells out a set of rules that 
map every conceivable circumstance to the corresponding action that our program should 
take. 


To build the brains of our application, we might enumerate all the common events that our 
program should handle. For example, whenever a customer clicks to add an item to their 
shopping cart, our program should add an entry to the shopping cart database table, associ- 
ating that user’s ID with the requested product’s ID. We might then attempt to step through 
every possible corner case, testing the appropriateness of our rules and making any neces- 
sary modifications. What happens if a user initiates a purchase with an empty cart? While 
few developers ever get it completely right the first time (it might take some test runs to 
work out the kinks), for the most part we can write such programs and confidently launch 
them before ever seeing a real customer. Our ability to manually design automated sys- 
tems that drive functioning products and systems, often in novel situations, is a remarkable 
cognitive feat. And when you are able to devise solutions that work 100% of the time, you 
typically should not be worrying about machine learning. 


Fortunately for the growing community of machine learning scientists, many tasks that we 
would like to automate do not bend so easily to human ingenuity. Imagine huddling around 
the whiteboard with the smartest minds you know, but this time you are tackling one of the 
following problems: 


e Write a program that predicts tomorrow’s weather given geographic information, satellite 
images, and a trailing window of past weather. 


e Write a program that takes in a factoid question, expressed in free-form text, and answers 
it correctly. 


e Write a program that, given an image, identifies every person depicted in it and draws 
outlines around each. 
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e Write a program that presents users with products that they are likely to enjoy but un- 
likely, in the natural course of browsing, to encounter. 


For these problems, even elite programmers would struggle to code up solutions from 
scratch. The reasons can vary. Sometimes the program that we are looking for follows 
a pattern that changes over time, so there is no fixed right answer! In such cases, any 
successful solution must adapt gracefully to a changing world. At other times, the rela- 
tionship (say between pixels, and abstract categories) may be too complicated, requiring 
thousands or millions of computations and following unknown principles. In the case of 
image recognition, the precise steps required to perform the task lie beyond our conscious 
understanding, even though our subconscious cognitive processes execute the task effort- 
lessly. 


Machine learning is the study of algorithms that can learn from experience. As a machine 
learning algorithm accumulates more experience, typically in the form of observational 
data or interactions with an environment, its performance improves. Contrast this with 
our deterministic e-commerce platform, which follows the same business logic, no matter 
how much experience accrues, until the developers themselves learn and decide that it is 
time to update the software. In this book, we will teach you the fundamentals of machine 
learning, focusing in particular on deep learning, a powerful set of techniques driving in- 
novations in areas as diverse as computer vision, natural language processing, healthcare, 
and genomics. 


1.1 A Motivating Example 
E R] 


Before beginning writing, the authors of this book, like much of the work force, had to 
become caffeinated. We hopped in the car and started driving. Using an iPhone, Alex called 
out “Hey Siri”, awakening the phone’s voice recognition system. Then Mu commanded 
“directions to Blue Bottle coffee shop”. The phone quickly displayed the transcription of 
his command. It also recognized that we were asking for directions and launched the Maps 
application (app) to fulfill our request. Once launched, the Maps app identified a number 
of routes. Next to each route, the phone displayed a predicted transit time. While this story 
was fabricated for pedagogical convenience, it demonstrates that in the span of just a few 
seconds, our everyday interactions with a smart phone can engage several machine learning 
models. 


Imagine just writing a program to respond to a wake word such as “Alexa”, “OK Google”, 
and “Hey Siri”. Try coding it up in a room by yourself with nothing but a computer and 
a code editor, as illustrated in Fig. 1.1.1. How would you write such a program from first 
principles? Think about it... the problem is hard. Every second, the microphone will col- 
lect roughly 44,000 samples. Each sample is a measurement of the amplitude of the sound 
wave. What rule could map reliably from a snippet of raw audio to confident predictions 
{yes, no} about whether the snippet contains the wake word? If you are stuck, do not worry. 


A Motivating Example 


We do not know how to write such a program from scratch either. That is why we use ma- 
chine learning. 


© ) U —> | Wake word model | —— fyes, no} 


Identify a wake word. 


Here is the trick. Often, even when we do not know how to tell a computer explicitly how 
to map from inputs to outputs, we are nonetheless capable of performing the cognitive feat 
ourselves. In other words, even if you do not know how to program a computer to rec- 
ognize the word “Alexa”, you yourself are able to recognize it. Armed with this ability, 
we can collect a huge dataset containing examples of audio snippets and associated labels, 
indicating which snippets contain the wake word. In the currently dominant approach to 
machine learning, we do not attempt to design a system explicitly to recognize wake words. 
Instead, we define a flexible program whose behavior is determined by a number of pa- 
rameters. Then we use the dataset to determine the best possible parameter values, i.e., 
those that improve the performance of our program with respect to a chosen performance 
measure. 


You can think of the parameters as knobs that we can turn, manipulating the behavior of 
the program. Once the parameters are fixed, we call the program a model. The set of all 
distinct programs (input-output mappings) that we can produce just by manipulating the 
parameters is called a family of models. And the “meta-program” that uses our dataset to 
choose the parameters is called a learning algorithm. 


Before we can go ahead and engage the learning algorithm, we have to define the problem 
precisely, pinning down the exact nature of the inputs and outputs, and choosing an ap- 
propriate model family. In this case, our model receives a snippet of audio as input, and 
the model generates a selection among {yes, no} as output. If all goes according to plan 
the model’s guesses will typically be correct as to whether the snippet contains the wake 
word. 


If we choose the right family of models, there should exist one setting of the knobs such 
that the model fires “yes” every time it hears the word “Alexa”. Because the exact choice of 
the wake word is arbitrary, we will probably need a model family sufficiently rich that, via 
another setting of the knobs, it could fire “yes” only upon hearing the word “Apricot”. We 
expect that the same model family should be suitable for “Alexa” recognition and “Apricot” 
recognition because they seem, intuitively, to be similar tasks. However, we might need a 
different family of models entirely if we want to deal with fundamentally different inputs 
or outputs, say if we wanted to map from images to captions, or from English sentences to 
Chinese sentences. 


As you might guess, if we just set all of the knobs randomly, it is unlikely that our model 
will recognize “Alexa”, “Apricot”, or any other English word. In machine learning, the 
learning is the process by which we discover the right setting of the knobs for coercing the 
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desired behavior from our model. In other words, we train our model with data. As shown 
in Fig. 1.1.2, the training process usually looks like the following: 


1. Start off with a randomly initialized model that cannot do anything useful. 
2. Grab some of your data (e.g., audio snippets and corresponding {yes, no} labels). 
3. Tweak the knobs to make the model perform better as assessed on those examples. 


4. Repeat Steps 2 and 3 until the model is awesome. 


Update the 
model 
Design a model I+ Grab new data 


| A typical training process. 


To summarize, rather than code up a wake word recognizer, we code up a program that can 
learn to recognize wake words, if presented with a large labeled dataset. You can think of 
this act of determining a program’s behavior by presenting it with a dataset as programming 
with data. That is to say, we can “program” a cat detector by providing our machine learning 
system with many examples of cats and dogs. This way the detector will eventually learn 
to emit a very large positive number if it is a cat, a very large negative number if it is a 
dog, and something closer to zero if it is not sure. This barely scratches the surface of what 
machine learning can do. Deep learning, which we will explain in greater detail later, is 
just one among many popular methods for solving machine learning problems. 


1.2 Key Components 
R) 


In our wake word example, we described a dataset consisting of audio snippets and binary 
labels, and we gave a hand-wavy sense of how we might train a model to approximate a 
mapping from snippets to classifications. This sort of problem, where we try to predict a 
designated unknown label based on known inputs given a dataset consisting of examples 
for which the labels are known, is called supervised learning. This is just one among many 
kinds of machine learning problems. Before we explore other varieties, we would like to 
shed more light on some core components that will follow us around, no matter what kind 
of machine learning problem we tackle: 


1. The data that we can learn from. 
2. A model of how to transform the data. 
3. An objective function that quantifies how well (or badly) the model is doing. 


4. An algorithm to adjust the model’s parameters to optimize the objective function. 


Key Components 


1.2.1 Data 


It might go without saying that you cannot do data science without data. We could lose 
hundreds of pages pondering what precisely data is, but for now, we will focus on the key 
properties of the datasets that we will be concerned with. Generally, we are concerned with 
a collection of examples. In order to work with data usefully, we typically need to come 
up with a suitable numerical representation. Each example (or data point, data instance, 
sample) typically consists of a set of attributes called features (sometimes called covariates 
or inputs), based on which the model must make its predictions. In supervised learning 
problems, our goal is to predict the value of a special attribute, called the label (or target), 
that is not part of the model’s input. 


If we were working with image data, each example might consist of an individual photo- 
graph (the features) and a number indicating the category to which the photograph belongs 
(the label). The photograph would be represented numerically as three grids of numerical 
values representing the brightness of red, green, and blue light at each pixel location. For 
example, a 200 x 200 pixel color photograph would consist of 200 x 200 x 3 = 120000 
numerical values. 


Alternatively, we might work with electronic health record data and tackle the task of pre- 
dicting the likelihood that a given patient will survive the next 30 days. Here, our features 
might consist of a collection of readily available attributes and frequently recorded mea- 
surements, including age, vital signs, comorbidities, current medications, and recent pro- 
cedures. The label available for training would be a binary value indicating whether each 
patient in the historical data survived within the 30-day window. 


In such cases, when every example is characterized by the same number of numerical fea- 
tures, we say that the inputs are fixed-length vectors and we call the (constant) length of 
the vectors the dimensionality of the data. As you might imagine, fixed-length inputs can 
be convenient, giving us one less complication to worry about. However, not all data can 
easily be represented as fixed-length vectors. While we might expect microscope images to 
come from standard equipment, we cannot expect images mined from the Internet all to have 
the same resolution or shape. For images, we might consider cropping them to a standard 
size, but that strategy only gets us so far. We risk losing information in the cropped-out 
portions. Moreover, text data resists fixed-length representations even more stubbornly. 
Consider the customer reviews left on e-commerce sites such as Amazon, IMDb, and Tri- 
pAdvisor. Some are short: “it stinks!”. Others ramble for pages. One major advantage of 
deep learning over traditional methods is the comparative grace with which modern models 
can handle varying-length data. 


Generally, the more data we have, the easier our job becomes. When we have more data, we 
can train more powerful models and rely less heavily on preconceived assumptions. The 
regime change from (comparatively) small to big data is a major contributor to the success 
of modern deep learning. To drive the point home, many of the most exciting models in 
deep learning do not work without large datasets. Some others might work in the small 
data regime, but are no better than traditional approaches. 


Finally, it is not enough to have lots of data and to process it cleverly. We need the right 
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data. If the data is full of mistakes, or if the chosen features are not predictive of the target 
quantity of interest, learning is going to fail. The situation is captured well by the cliché: 
garbage in, garbage out. Moreover, poor predictive performance is not the only poten- 
tial consequence. In sensitive applications of machine learning, like predictive policing, 
resume screening, and risk models used for lending, we must be especially alert to the con- 
sequences of garbage data. One commonly occurring failure mode concerns datasets where 
some groups of people are unrepresented in the training data. Imagine applying a skin can- 
cer recognition system that had never seen black skin before. Failure can also occur when 
the data does not only under-represent some groups but reflects societal prejudices. For ex- 
ample, if past hiring decisions are used to train a predictive model that will be used to screen 
resumes then machine learning models could inadvertently capture and automate historical 
injustices. Note that this can all happen without the data scientist actively conspiring, or 
even being aware. 


1.2.2 Models 


Most machine learning involves transforming the data in some sense. We might want to 
build a system that ingests photos and predicts smiley-ness. Alternatively, we might want to 
ingest a set of sensor readings and predict how normal vs. anomalous the readings are. By 
model, we denote the computational machinery for ingesting data of one type, and spitting 
out predictions of a possibly different type. In particular, we are interested in statistical 
models that can be estimated from data. While simple models are perfectly capable of ad- 
dressing appropriately simple problems, the problems that we focus on in this book stretch 
the limits of classical methods. Deep learning is differentiated from classical approaches 
principally by the set of powerful models that it focuses on. These models consist of many 
successive transformations of the data that are chained together top to bottom, thus the 
name deep learning. On our way to discussing deep models, we will also discuss some 
more traditional methods. 


1.2.3 Objective Functions 


Earlier, we introduced machine learning as learning from experience. By learning here, we 
mean improving at some task over time. But whois to say what constitutes an improvement? 
You might imagine that we could propose updating our model, and some people might 
disagree on whether our proposal constituted an improvement or not. 


In order to develop a formal mathematical system of learning machines, we need to have 
formal measures of how good (or bad) our models are. In machine learning, and optimiza- 
tion more generally, we call these objective functions. By convention, we usually define 
objective functions so that lower is better. This is merely a convention. You can take any 
function for which higher is better, and turn it into a new function that is qualitatively iden- 
tical but for which lower is better by flipping the sign. Because we choose lower to be 
better, these functions are sometimes called loss functions. 


When trying to predict numerical values, the most common loss function is squared error, 
i.e., the square of the difference between the prediction and the ground truth target. For 
classification, the most common objective is to minimize error rate, i.e., the fraction of 
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examples on which our predictions disagree with the ground truth. Some objectives (e.g., 
squared error) are easy to optimize, while others (e.g., error rate) are difficult to optimize 
directly, owing to non-differentiability or other complications. In these cases, it is common 
instead to optimize a surrogate objective. 


During optimization, we think of the loss as a function of the model’s parameters, and treat 
the training dataset as a constant. We learn the best values of our model’s parameters by 
minimizing the loss incurred on a set consisting of some number of examples collected for 
training. However, doing well on the training data does not guarantee that we will do well 
on unseen data. So we will typically want to split the available data into two partitions: 
the training dataset (or training set), for learning model parameters; and the test dataset 
(or test set), which is held out for evaluation. At the end of the day, we typically report 
how our models perform on both partitions. You could think of training performance as 
analogous to the scores that a student achieves on the practice exams used to prepare for 
some real final exam. Even if the results are encouraging, that does not guarantee success 
on the final exam. Over the course of studying, the student might begin to memorize the 
practice questions, appearing to master the topic but faltering when faced with previously 
unseen questions on the actual final exam. When a model performs well on the training set 
but fails to generalize to unseen data, we say that it is overfitting to the training data. 


1.2.4 Optimization Algorithms 


Once we have got some data source and representation, a model, and a well-defined objec- 
tive function, we need an algorithm capable of searching for the best possible parameters 
for minimizing the loss function. Popular optimization algorithms for deep learning are 
based on an approach called gradient descent. In brief, at each step, this method checks 
to see, for each parameter, how that training set loss would change if you perturbed that 
parameter by just a small amount. It would then update the parameter in the direction that 
lowers the loss. 


1.3 Kinds of Machine Learning Problems 
EEE 


The wake word problem in our motivating example is just one among many that machine 
learning can tackle. To motivate the reader further and provide us with some common 
language that will follow us throughout the book, we now provide a broad overview of the 
landscape of machine learning problems. 


1.3.1 Supervised Learning 


Supervised learning describes tasks where we are given a dataset containing both features 
and labels and asked to produce a model that predicts the labels when given input features. 
Each feature—label pair is called an example. Sometimes, when the context is clear, we 
may use the term examples to refer to a collection of inputs, even when the corresponding 
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labels are unknown. The supervision comes into play because, for choosing the parame- 
ters, we (the supervisors) provide the model with a dataset consisting of labeled examples. 
In probabilistic terms, we typically are interested in estimating the conditional probability 
of a label given input features. While it is just one among several paradigms, supervised 
learning accounts for the majority of successful applications of machine learning in indus- 
try. Partly that is because many important tasks can be described crisply as estimating the 
probability of something unknown given a particular set of available data: 


e Predict cancer vs. not cancer, given a computer tomography image. 
e Predict the correct translation in French, given a sentence in English. 
e Predict the price of a stock next month based on this month’s financial reporting data. 


While all supervised learning problems are captured by the simple description “predicting 
the labels given input features”, supervised learning itself can take diverse forms and require 
tons of modeling decisions, depending on (among other considerations) the type, size, and 
quantity of the inputs and outputs. For example, we use different models for processing 
sequences of arbitrary lengths and fixed-length vector representations. We will visit many 
of these problems in depth throughout this book. 


Informally, the learning process looks something like the following. First, grab a big col- 
lection of examples for which the features are known and select from them a random subset, 
acquiring the ground truth labels for each. Sometimes these labels might be available data 
that have already been collected (e.g., did a patient die within the following year?) and 
other times we might need to employ human annotators to label the data, (e.g., assigning 
images to categories). Together, these inputs and corresponding labels comprise the train- 
ing set. We feed the training dataset into a supervised learning algorithm, a function that 
takes as input a dataset and outputs another function: the learned model. Finally, we can 
feed previously unseen inputs to the learned model, using its outputs as predictions of the 
corresponding label. The full process is drawn in Fig. 1.3.1. 
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Supervised learning. 


Regression 


Perhaps the simplest supervised learning task to wrap your head around is regression. Con- 
sider, for example, a set of data harvested from a database of home sales. We might con- 
struct a table, in which each row corresponds to a different house, and each column cor- 
responds to some relevant attribute, such as the square footage of a house, the number of 
bedrooms, the number of bathrooms, and the number of minutes (walking) to the center 
of town. In this dataset, each example would be a specific house, and the corresponding 
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feature vector would be one row in the table. If you live in New York or San Francisco, and 
you are not the CEO of Amazon, Google, Microsoft, or Facebook, the (sq. footage, no. of 
bedrooms, no. of bathrooms, walking distance) feature vector for your home might look 
something like: [600, 1, 1,60]. However, if you live in Pittsburgh, it might look more like 
[3000, 4, 3, 10]. Fixed-length feature vectors like this are essential for most classic machine 
learning algorithms. 


What makes a problem a regression is actually the form of the target. Say that you are in the 
market for anew home. You might want to estimate the fair market value of a house, given 
some features such as above. The data here might consist of historical home listings and the 
labels might be the observed sales prices. When labels take on arbitrary numerical values 
(even within some interval), we call this a regression problem. The goal is to produce a 
model whose predictions closely approximate the actual label values. 


Lots of practical problems are easily described as regression problems. Predicting the rating 
that a user will assign to a movie can be thought of as a regression problem and if you 
designed a great algorithm to accomplish this feat in 2009, you might have won the 1- 
million-dollar Netflix prize !®. Predicting the length of stay for patients in the hospital is 
also a regression problem. A good rule of thumb is that any how much? or how many? 


# problem is likely to be regression. For example: 


e How many hours will this surgery take? 
e How much rainfall will this town have in the next six hours? 


Even if you have never worked with machine learning before, you have probably worked 
through a regression problem informally. Imagine, for example, that you had your drains re- 
paired and that your contractor spent 3 hours removing gunk from your sewage pipes. Then 
they sent you a bill of 350 dollars. Now imagine that your friend hired the same contractor 
for 2 hours and received a bill of 250 dollars. If someone then asked you how much to 
expect on their upcoming gunk-removal invoice you might make some reasonable assump- 
tions, such as more hours worked costs more dollars. You might also assume that there is 
some base charge and that the contractor then charges per hour. If these assumptions held 
true, then given these two data examples, you could already identify the contractor’s pricing 
structure: 100 dollars per hour plus 50 dollars to show up at your house. If you followed 
that much, then you already understand the high-level idea behind linear regression. 


In this case, we could produce the parameters that exactly matched the contractor’s prices. 
Sometimes this is not possible, e.g., if some of the variation arises from factors beyond 
your two features. In these cases, we will try to learn models that minimize the distance 
between our predictions and the observed values. In most of our chapters, we will focus on 
minimizing the squared error loss function. As we will see later, this loss corresponds to 
the assumption that our data were corrupted by Gaussian noise. 


Classification 


While regression models are great for addressing how many? questions, lots of problems do 
not fit comfortably in this template. Consider, for example, a bank that wants to develop a 
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check scanning feature for its mobile app. Ideally, the customer would simply snap a photo 
of a check and the app would automatically recognize the text from the image. Assuming 
that we had some ability to segment out image patches corresponding to each handwritten 
character, then the primary remaining task would be to determine which character among 
some known set is depicted in each image patch. These kinds of which one? problems 
are called classification and require a different set of tools from those used for regression, 
although many techniques will carry over. 


In classification, we want our model to look at features, e.g., the pixel values in an image, 
and then predict to which category (sometimes called a class) among some discrete set 
of options, an example belongs. For handwritten digits, we might have ten classes, corre- 
sponding to the digits 0 through 9. The simplest form of classification is when there are 
only two classes, a problem which we call binary classification. For example, our dataset 
could consist of images of animals and our labels might be the classes {cat, dog}. Whereas 
in regression we sought a regressor to output a numerical value, in classification we seek a 
classifier, whose output is the predicted class assignment. 


For reasons that we will get into as the book gets more technical, it can be difficult to opti- 
mize a model that can only output a firm categorical assignment, e.g., either “cat” or “dog”. 
In these cases, it is usually much easier to express our model in the language of probabili- 
ties. Given features of an example, our model assigns a probability to each possible class. 
Returning to our animal classification example where the classes are {cat, dog}, a classi- 
fier might see an image and output the probability that the image is a cat as 0.9. We can 
interpret this number by saying that the classifier is 90% sure that the image depicts a cat. 
The magnitude of the probability for the predicted class conveys a notion of uncertainty. 
It is not the only one available and we will discuss others in chapters dealing with more 
advanced topics. 


When we have more than two possible classes, we call the problem multiclass classification. 


Common examples include handwritten character recognition {0, 1, 2, ... 9, a, b, c, ...}. While 


we attacked regression problems by trying to minimize the squared error loss function, the 
common loss function for classification problems is called cross-entropy, whose name will 
be demystified when we introduce information theory in later chapters. 


Note that the most likely class is not necessarily the one that you are going to use for your 
decision. Assume that you find a beautiful mushroom in your backyard as shown in Fig. 
1.3.2. 


Now, assume that you built a classifier and trained it to predict whether a mushroom is poi- 
sonous based on a photograph. Say our poison-detection classifier outputs that the proba- 
bility that Fig. 1.3.2 shows a death cap is 0.2. In other words, the classifier is 80% sure that 
our mushroom is not a death cap. Still, you would have to be a fool to eat it. That is because 
the certain benefit of a delicious dinner is not worth a 20% risk of dying from it. In other 
words, the effect of the uncertain risk outweighs the benefit by far. Thus, in order to make 
a decision about whether to eat the mushroom, we need to compute the expected detriment 
associated with each action which depends both on the likely outcomes and the benefits or 
harms associated with each. In this case, the detriment incurred by eating the mushroom 
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isi 8s)2 Death cap - do not eat! 


might be 0.2 x co + 0.8 x 0 = co, whereas the loss of discarding it is 0.2 x0+0.8 x 1 = 0.8. 
Our caution was justified: as any mycologist would tell us, the mushroom in Fig. 1.3.2 is 
actually a death cap. 


Classification can get much more complicated than just binary or multiclass classification. 
For instance, there are some variants of classification addressing hierarchically structured 
classes. In such cases not all errors are equal—if we must err, we might prefer to misclassify 
to a related class rather than a distant class. Usually, this is referred to as hierarchical 
classification. For inspiration, you might think of Linnaeus ?°, who organized fauna in a 
hierarchy. 


In the case of animal classification, it might not be so bad to mistake a poodle for a schnauzer, 
but our model would pay a huge penalty if it confused a poodle with a dinosaur. Which 
hierarchy is relevant might depend on how you plan to use the model. For example, rat- 
tlesnakes and garter snakes might be close on the phylogenetic tree, but mistaking a rattler 
for a garter could have fatal consequences. 


Tagging 


Some classification problems fit neatly into the binary or multiclass classification setups. 
For example, we could train a normal binary classifier to distinguish cats from dogs. Given 
the current state of computer vision, we can do this easily, with off-the-shelf tools. Nonethe- 
less, no matter how accurate our model gets, we might find ourselves in trouble when the 
classifier encounters an image of the Town Musicians of Bremen, a popular German fairy 
tale featuring four animals (Fig. 1.3.3). 


As you can see, the photo features a cat, a rooster, a dog, and a donkey, with some trees in 
the background. If we anticipate encountering such images, multiclass classification might 
not be the right problem formulation. Instead, we might want to give the model the option 
of saying the image depicts a cat, a dog, a donkey, and a rooster. 
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A donkey, a dog, a cat, and a rooster. 


The problem of learning to predict classes that are not mutually exclusive is called multi- 
label classification. Auto-tagging problems are typically best described in terms of multi- 
label classification. Think of the tags people might apply to posts on a technical blog, e.g., 
“machine learning”, “technology”, “gadgets”, “programming languages”, “Linux”, “cloud 
computing”, “AWS”. A typical article might have 5-10 tags applied. Typically, tags will 
exhibit some correlation structure. Posts about “cloud computing” are likely to mention 


“AWS” and posts about “machine learning” are likely to mention “GPUs”. 


Sometimes such tagging problems draw on enormous label sets. The National Library of 
Medicine employs many professional annotators who associate each article to be indexed in 
PubMed with a set of tags drawn from the Medical Subject Headings (MeSH) ontology, a 
collection of roughly 28,000 tags. Correctly tagging articles is important because it allows 
researchers to conduct exhaustive reviews of the literature. This is a time-consuming pro- 
cess and typically there is a one-year lag between archiving and tagging. Machine learning 
can provide provisional tags until each article has a proper manual review. Indeed, for 
several years, the BioASQ organization has hosted competitions?! for this task. 
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Search 


In the field of information retrieval, we often impose ranks on sets of items. Take web 
search for example. The goal is less to determine whether a particular page is relevant for a 
query, but rather which, among a set of relevant results, should be shown most prominently 
to a particular user. One way of doing this might be to first assign a score to every element 
in the set and then to retrieve the top-rated elements. PageRank ?? , the original secret 


= sauce behind the Google search engine, was an early example of such a scoring system. 


Weirdly, the scoring provided by PageRank did not depend on the actual query. Instead, 
they relied on a simple relevance filter to identify the set of relevant candidates and then 
used PageRank to prioritize the more authoritative pages. Nowadays, search engines use 
machine learning and behavioral models to obtain query-dependent relevance scores. There 
are entire academic conferences devoted to this subject. 


Recommender Systems 


Recommender systems are another problem setting that is related to search and ranking. 
The problems are similar insofar as the goal is to display a set of items relevant to the user. 
The main difference is the emphasis on personalization to specific users in the context of 
recommender systems. For instance, for movie recommendations, the results page for a 
science fiction fan and the results page for a connoisseur of Peter Sellers comedies might 
differ significantly. Similar problems pop up in other recommendation settings, e.g., for 
retail products, music, and news recommendation. 


In some cases, customers provide explicit feedback, communicating how much they liked a 
particular product (e.g., the product ratings and reviews on Amazon, IMDb, or Goodreads). 
In other cases, they provide implicit feedback, e.g., by skipping titles on a playlist, which 
might indicate dissatisfaction or maybe just indicate that the song was inappropriate in 
context. In the simplest formulations, these systems are trained to estimate some score, 
such as an expected star rating or the probability that a given user will purchase a particular 
item. 


Given such a model, for any given user, we could retrieve the set of objects with the largest 
scores, which could then be recommended to the user. Production systems are consider- 
ably more advanced and take detailed user activity and item characteristics into account 
when computing such scores. Fig. 1.3.4 displays the deep learning books recommended by 
Amazon based on personalization algorithms tuned to capture Aston’s preferences. 


Despite their tremendous economic value, recommender systems naively built on top of 
predictive models suffer some serious conceptual flaws. To start, we only observe censored 
feedback: users preferentially rate movies that they feel strongly about. For example, on 
a five-point scale, you might notice that items receive many one- and five-star ratings but 
that there are conspicuously few three-star ratings. Moreover, current purchase habits are 
often a result of the recommendation algorithm currently in place, but learning algorithms 
do not always take this detail into account. Thus it is possible for feedback loops to form 
where a recommender system preferentially pushes an item that is then taken to be better 
(due to greater purchases) and in turn is recommended even more frequently. Many of 
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these problems—about how to deal with censoring, incentives, and feedback loops—are 
important open research questions. 


Sequence Learning 


So far, we have looked at problems where we have some fixed number of inputs and produce 
a fixed number of outputs. For example, we considered predicting house prices given a 
fixed set of features: square footage, number of bedrooms, number of bathrooms, and the 
transit time to downtown. We also discussed mapping from an image (of fixed dimension) 
to the predicted probabilities that it belongs to each among a fixed number of classes and 
predicting star ratings associated with purchases based on the user ID and product ID alone. 
In these cases, once our model is trained, after each test example is fed into our model, it 
is immediately forgotten. We assumed that successive observations were independent and 
thus there was no need to hold on to this context. 


But how should we deal with video snippets? In this case, each snippet might consist of 
a different number of frames. And our guess of what is going on in each frame might be 
much stronger if we take into account the previous or succeeding frames. The same goes for 
language. For example, one popular deep learning problem is machine translation: the task 
of ingesting sentences in some source language and predicting their translations in another 
language. 


Such problems also occur in medicine. We might want a model to monitor patients in the 
intensive care unit and to fire off alerts whenever their risk of dying in the next 24 hours 
exceeds some threshold. Here, we would not throw away everything that we know about 
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the patient history every hour, because we might not want to make predictions based only 
on the most recent measurements. 


Questions like these are among the most exciting applications of machine learning and 
they are instances of sequence learning. They require a model either to ingest sequences 
of inputs or to emit sequences of outputs (or both). Specifically, sequence-to-sequence 
learning considers problems where both inputs and outputs consist of variable-length se- 
quences. Examples include machine translation and speech-to-text transcription. While it 
is impossible to consider all types of sequence transformations, the following special cases 
are worth mentioning. 


Tagging and Parsing. This involves annotating a text sequence with attributes. Here, 
the inputs and outputs are aligned, i.e., they are of the same number and occur in a corre- 
sponding order. For instance, in part-of-speech (PoS) tagging, we annotate every word in 
a sentence with the corresponding part of speech, i.e., “noun” or “direct object”. Alterna- 
tively, we might want to know which groups of contiguous words refer to named entities, 
like people, places, or organizations. In the cartoonishly simple example below, we might 
just want to indicate whether or not any word in the sentence is part of a named entity 
(tagged as “Ent’). 


Tom has dinner in Washington with Sally 
Ent = = Ent = Ent 


Automatic Speech Recognition. With speech recognition, the input sequence is an audio 
recording of a speaker (Fig. 1.3.5), and the output is a transcript of what the speaker said. 
The challenge is that there are many more audio frames (sound is typically sampled at 
8kHz or 16kHz) than text, i.e., there is no 1:1 correspondence between audio and text, 
since thousands of samples may correspond to a single spoken word. These are sequence- 
to-sequence learning problems, where the output is much shorter than the input. While 
humans are remarkably good at recognizing speech, even from low-quality audio, getting 
computers to perform the same feat is a formidable challenge. 


dated Wh m! Ti lie | 


-D-e-e-p- L-ea-r-ni-ng- in an audio recording. 


Jo 


Text to Speech. This is the inverse of automatic speech recognition. Here, the input is text 
and the output is an audio file. In this case, the output is much longer than the input. 


Machine Translation. Unlike the case of speech recognition, where corresponding inputs 
and outputs occur in the same order, in machine translation, unaligned data poses a new 
challenge. Here the input and output sequences can have different lengths, and the corre- 
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sponding regions of the respective sequences may appear in a different order. Consider the 
following illustrative example of the peculiar tendency of Germans to place the verbs at the 
end of sentences: 


German: Haben Sie sich schon dieses grossartige Lehrwerk angeschaut? 
English: Have you already looked at this excellent textbook? 
Wrong alignment: Have you yourself already this excellent textbook looked at? 


Many related problems pop up in other learning tasks. For instance, determining the order 
in which a user reads a webpage is a two-dimensional layout analysis problem. Dialogue 
problems exhibit all kinds of additional complications, where determining what to say next 
requires taking into account real-world knowledge and the prior state of the conversation 
across long temporal distances. Such topics are active areas of research. 


1.3.2 Unsupervised and Self-Supervised Learning 


The previous examples focused on supervised learning, where we feed the model a giant 
dataset containing both the features and corresponding label values. You could think of 
the supervised learner as having an extremely specialized job and an extremely dictatorial 
boss. The boss stands over the learner’s shoulder and tells them exactly what to do in every 
situation until they learn to map from situations to actions. Working for such a boss sounds 
pretty lame. On the other hand, pleasing such a boss is pretty easy. You just recognize the 
pattern as quickly as possible and imitate the boss’s actions. 


Considering the opposite situation, it could be frustrating to work for a boss who has no 
idea what they want you to do. However, if you plan to be a data scientist, you had better 
get used to it. The boss might just hand you a giant dump of data and tell you to do some 
data science with it! This sounds vague because it is vague. We call this class of problems 
unsupervised learning, and the type and number of questions we can ask is limited only by 
our creativity. We will address unsupervised learning techniques in later chapters. To whet 
your appetite for now, we describe a few of the following questions you might ask. 


e Can we find a small number of prototypes that accurately summarize the data? Given a 
set of photos, can we group them into landscape photos, pictures of dogs, babies, cats, 
and mountain peaks? Likewise, given a collection of users’ browsing activities, can 
we group them into users with similar behavior? This problem is typically known as 
clustering. 


e Can we find a small number of parameters that accurately capture the relevant properties 
of the data? The trajectories of a ball are well described by velocity, diameter, and 
mass of the ball. Tailors have developed a small number of parameters that describe 
human body shape fairly accurately for the purpose of fitting clothes. These problems 
are referred to as subspace estimation. If the dependence is linear, it is called principal 
component analysis. 


e Is there a representation of (arbitrarily structured) objects in Euclidean space such that 
symbolic properties can be well matched? This can be used to describe entities and 
their relations, such as “Rome” — “Italy” + “France” = “Paris”. 
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e Is there a description of the root causes of much of the data that we observe? For instance, 
if we have demographic data about house prices, pollution, crime, location, education, 
and salaries, can we discover how they are related simply based on empirical data? 
The fields concerned with causality and probabilistic graphical models tackle such 
questions. 


e Another important and exciting recent development in unsupervised learning is the ad- 
vent of deep generative models. These models estimate the density of the data, either 
explicitly or implicitly. Once trained, we can use a generative model either to score 
examples according to how likely they are, or to sample synthetic examples from the 
learned distribution. Early deep learning breakthroughs in generative modeling came 
with the invention of variational autoencoders (Kingma and Welling, 2014, Rezende 
et al., 2014) and continued with the development of generative adversarial networks 
(Goodfellow et al., 2014). More recent advances include normalizing flows (Dinh et 
al., 2014, Dinh et al., 2017) and diffusion models (Ho et al., 2020, Sohl-Dickstein et 
al., 2015, Song and Ermon, 2019, Song et al., 2021). 


A further development in unsupervised learning has been the rise of self-supervised learn- 
ing, techniques that leverage some aspect of the unlabeled data to provide supervision. For 
text, we can train models to “fill in the blanks” by predicting randomly masked words us- 
ing their surrounding words (contexts) in big corpora without any labeling effort (Devlin 
et al., 2018)! For images, we may train models to tell the relative position between two 
cropped regions of the same image (Doersch et al., 2015), to predict an occluded part of an 
image based on the remaining portions of the image, or to predict whether two examples 
are perturbed versions of the same underlying image. Self-supervised models often learn 
representations that are subsequently leveraged by fine-tuning the resulting models on some 
downstream task of interest. 


1.3.3 Interacting with an Environment 


So far, we have not discussed where data actually comes from, or what actually happens 
when a machine learning model generates an output. That is because supervised learning 
and unsupervised learning do not address these issues in a very sophisticated way. In each 
case, we grab a big pile of data upfront, then set our pattern recognition machines in motion 
without ever interacting with the environment again. Because all the learning takes place 
after the algorithm is disconnected from the environment, this is sometimes called offline 
learning. For example, supervised learning assumes the simple interaction pattern depicted 
in Fig. 1.3.6. 


This simplicity of offline learning has its charms. The upside is that we can worry about 
pattern recognition in isolation, with no concern about complications arising from interac- 
tions with a dynamic environment. But this problem formulation is limiting. If you grew 
up reading Asimov’s Robot novels, then you probably picture artificially intelligent agents 
capable not only of making predictions, but also of taking actions in the world. We want 
to think about intelligent agents, not just predictive models. This means that we need to 
think about choosing actions, not just making predictions. In contrast to mere predictions, 
actions actually impact the environment. If we want to train an intelligent agent, we must 
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Collecting data for supervised learning from an environment. 
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account for the way its actions might impact the future observations of the agent, and so 
offline learning is inappropriate. 


Considering the interaction with an environment opens a whole set of new modeling ques- 
tions. The following are just a few examples. 


e Does the environment remember what we did previously? 
e Does the environment want to help us, e.g., a user reading text into a speech recognizer? 


e Does the environment want to beat us, e.g., spammers adapting their emails to evade 
spam filters? 


e Does the environment have shifting dynamics? For example, would future data always 
resemble the past or would the patterns change over time, either naturally or in re- 
sponse to our automated tools? 


These questions raise the problem of distribution shift, where training and test data are 
different. An example of this, that many of us may have met, is when taking exams written 
by a lecturer, while the homework was composed by their teaching assistants. Next, we 
briefly describe reinforcement learning, a rich framework for posing learning problems in 
which an agent interacts with an environment. 


1.3.4 Reinforcement Learning 


If you are interested in using machine learning to develop an agent that interacts with an 
environment and takes actions, then you are probably going to wind up focusing on re- 
inforcement learning. This might include applications to robotics, to dialogue systems, 
and even to developing artificial intelligence (AI) for video games. Deep reinforcement 
learning, which applies deep learning to reinforcement learning problems, has surged in 
popularity. The breakthrough deep Q-network, that beat humans at Atari games using only 
the visual input (Mnih et al., 2015), and the AlphaGo program, which dethroned the world 
champion at the board game Go (Silver et al., 2016), are two prominent examples. 


Reinforcement learning gives a very general statement of a problem in which an agent inter- 
acts with an environment over a series of time steps. At each time step, the agent receives 
some observation from the environment and must choose an action that is subsequently 
transmitted back to the environment via some mechanism (sometimes called an actuator), 
when, after each loop, the agent receives a reward from the environment. This process is 
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illustrated in Fig. 1.3.7. The agent then receives a subsequent observation, and chooses a 
subsequent action, and so on. The behavior of a reinforcement learning agent is governed 
by a policy. In brief, a policy is just a function that maps from observations of the environ- 
ment to actions. The goal of reinforcement learning is to produce good policies. 


Environment 


Observation 


The interaction between reinforcement learning and an environment. 


It is hard to overstate the generality of the reinforcement learning framework. For example, 
supervised learning can be recast as reinforcement learning. Say we had a classification 
problem. We could create a reinforcement learning agent with one action corresponding 
to each class. We could then create an environment which gave a reward that was exactly 
equal to the loss function from the original supervised learning problem. 


Further, reinforcement learning can also address many problems that supervised learning 
cannot. For example, in supervised learning, we always expect that the training input comes 
associated with the correct label. But in reinforcement learning, we do not assume that, 
for each observation the environment tells us the optimal action. In general, we just get 
some reward. Moreover, the environment may not even tell us which actions led to the 
reward. 


Consider the game of chess. The only real reward signal comes at the end of the game when 
we either win, earning a reward of, say, 1, or when we lose, receiving a reward of, say, 
—1. So reinforcement learners must deal with the credit assignment problem: determining 
which actions to credit or blame for an outcome. The same goes for an employee who gets 
a promotion on October 11. That promotion likely reflects a number of well-chosen actions 
over the previous year. Getting promoted in the future requires figuring out which actions 
along the way led to the earlier promotions. 


Reinforcement learners may also have to deal with the problem of partial observability. 
That is, the current observation might not tell you everything about your current state. Say 
your cleaning robot found itself trapped in one of many identical closets in your house. 
Rescuing the robot involves inferring its precise location which might require considering 
earlier observations prior to it entering the closet. 


Finally, at any given point, reinforcement learners might know of one good policy, but 
there might be many other better policies that the agent has never tried. The reinforcement 
learner must constantly choose whether to exploit the best (currently) known strategy as a 
policy, or to explore the space of strategies, potentially giving up some short-term reward 
in exchange for knowledge. 


The general reinforcement learning problem has a very general setting. Actions affect sub- 
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sequent observations. Rewards are only observed when they correspond to the chosen ac- 
tions. The environment may be either fully or partially observed. Accounting for all this 
complexity at once may be asking too much. Moreover, not every practical problem ex- 
hibits all this complexity. As a result, researchers have studied a number of special cases 
of reinforcement learning problems. 


When the environment is fully observed, we call the reinforcement learning problem a 
Markov decision process. When the state does not depend on the previous actions, we call 
it a contextual bandit problem. When there is no state, just a set of available actions with 
initially unknown rewards, we have the classic multi-armed bandit problem. 


1.4 Roots 
eee) 


We have just reviewed a small subset of problems that machine learning can address. For 
a diverse set of machine learning problems, deep learning provides powerful tools for their 
solution. Although many deep learning methods are recent inventions, the core ideas be- 
hind learning from data have been studied for centuries. In fact, humans have held the 
desire to analyze data and to predict future outcomes for ages, and it is this desire that is 
at the root of much of natural science and mathematics. Two examples are the Bernoulli 
distribution, named after Jacob Bernoulli (1655-1705) 2° , and the Gaussian distribution 
discovered by Carl Friedrich Gauss (1777-1855) 24 Gauss invented, for instance, the least 


* mean squares algorithm, which is still used today for a multitude of problems from insur- 


ance calculations to medical diagnostics. Such tools enhanced the experimental approach 
in the natural sciences—for instance, Ohm’s law relating current and voltage in a resistor 
is perfectly described by a linear model. 


Even in the middle ages, mathematicians had a keen intuition of estimates. For instance, 
the geometry book of Jacob Köbel (1460-1533) ?° illustrates averaging the length of 16 
adult men’s feet to estimate the typical foot length in the population (Fig. 1.4.1). 


As a group of individuals exited a church, 16 adult men were asked to line up in a row 
and have their feet measured. The sum of these measurements was then divided by 16 to 
obtain an estimate for what now is called one foot. This “algorithm” was later improved to 
deal with misshapen feet; The two men with the shortest and longest feet were sent away, 
averaging only over the remainder. This is among the earliest examples of a trimmed mean 
estimate. 


Statistics really took off with the availability and collection of data. One of its pioneers, 
Ronald Fisher (1890-1962) ?6 , contributed significantly to its theory and also its applica- 
tions in genetics. Many of his algorithms (such as linear discriminant analysis) and con- 


* cepts (such as the Fisher information matrix) still hold a prominent place in the founda- 


tions of modern statistics. Even his data resources had a lasting impact. The Iris dataset 
that Fisher released in 1936 is still sometimes used to demonstrate machine learning algo- 
rithms. Fisher was also a proponent of eugenics, which should remind us that the morally 
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| Estimating the length of a foot. 


dubious use of data science has as long and enduring a history as its productive use in 
industry and the natural sciences. 


Other influences for machine learning came from the information theory of Claude Shan- 
non (1916-2001)?" and the theory of computation proposed by Alan Turing (1912-1954) 
Æ 28. Turing posed the question “can machines think?” in his famous paper Computing Ma- 
' chinery and Intelligence (Turing, 1950). Describing what is now known as the Turing test, 
he proposed that a machine can be considered intelligent if it is difficult for a human evalu- 
ator to distinguish between the replies from a machine and those of a human, based purely 
on textual interactions. 


Further influences came from neuroscience and psychology. After all, humans clearly ex- 

hibit intelligent behavior. Many scholars have asked whether one could explain and pos- 

sibly reverse engineer this capacity. One of the first biologically inspired algorithms was 

formulated by Donald Hebb (1904-1985). In his groundbreaking book The Organiza- 

tion of Behavior (Hebb, 1949), he posited that neurons learn by positive reinforcement. 

” This became known as the Hebbian learning rule. These ideas inspired later work, such 
as Rosenblatt’s perceptron learning algorithm, and laid the foundations of many stochastic 
gradient descent algorithms that underpin deep learning today: reinforce desirable behav- 
ior and diminish undesirable behavior to obtain good settings of the parameters in a neural 
network. 
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Biological inspiration is what gave neural networks their name. For over a century (dating 
back to the models of Alexander Bain, 1873, and James Sherrington, 1890), researchers 
have tried to assemble computational circuits that resemble networks of interacting neurons. 
Over time, the interpretation of biology has become less literal, but the name stuck. At its 
heart lie a few key principles that can be found in most networks today: 


e The alternation of linear and nonlinear processing units, often referred to as layers. 


e The use of the chain rule (also known as backpropagation) for adjusting parameters in 
the entire network at once. 


After initial rapid progress, research in neural networks languished from around 1995 until 
2005. This was mainly due to two reasons. First, training a network is computationally 
very expensive. While random-access memory was plentiful at the end of the past century, 
computational power was scarce. Second, datasets were relatively small. In fact, Fisher’s 
Iris dataset from 1936 was still a popular tool for testing the efficacy of algorithms. The 
MNIST dataset with its 60,000 handwritten digits was considered huge. 


Given the scarcity of data and computation, strong statistical tools such as kernel methods, 
decision trees, and graphical models proved empirically superior in many applications. 
Moreover, unlike neural networks, they did not require weeks to train and provided pre- 
dictable results with strong theoretical guarantees. 


1.5 The Road to Deep Learning 


Much of this changed with the availability of massive amounts of data, thanks to the World 
Wide Web, the advent of companies serving hundreds of millions of users online, a dis- 
semination of low-cost, high-quality sensors, inexpensive data storage (Kryder’s law), and 
cheap computation (Moore’s law). In particular, the landscape of computation in deep 
learning was revolutionized by advances in GPUs that were originally engineered for com- 
puter gaming. Suddenly algorithms and models that seemed computationally infeasible 
were within reach. This is best illustrated in tab_intro_decade. 


:Dataset vs. computer memory and computational power 


Table 1.5.1: label:tab_intro_decade 
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Decade} Dataset Mem- | Floating point calculations per 
ory second 
1970 100 (Iris) 1 KB 100 KF (Intel 8080) 
1980 1 K (house prices in Boston) 100 1 MF (Intel 80186) 
KB 
1990 10 K (optical character recog- | 10 MB | 10 MF (Intel 80486) 
nition) 
2000 10 M (web pages) 100 1 GF (Intel Core) 
MB 
2010 10 G (advertising) 1 GB 1 TF (NVIDIA C2050) 
2020 1 T (social network) 100 1 PF (NVIDIA DGX-2) 
GB 


Note that random-access memory has not kept pace with the growth in data. At the same 
time, increases in computational power have outpaced the growth in datasets. This means 
that statistical models need to become more memory efficient, and so they are free to spend 
more computer cycles optimizing parameters, thanks to the increased compute budget. 
Consequently, the sweet spot in machine learning and statistics moved from (generalized) 
linear models and kernel methods to deep neural networks. This is also one of the rea- 
sons why many of the mainstays of deep learning, such as multilayer perceptrons (McCul- 
loch and Pitts, 1943), convolutional neural networks (LeCun et al., 1998), long short-term 
memory (Hochreiter and Schmidhuber, 1997), and Q-Learning (Watkins and Dayan, 1992), 
were essentially “rediscovered” in the past decade, after lying comparatively dormant for 
considerable time. 


The recent progress in statistical models, applications, and algorithms has sometimes been 
likened to the Cambrian explosion: a moment of rapid progress in the evolution of species. 
Indeed, the state of the art is not just a mere consequence of available resources applied 
to decades-old algorithms. Note that the list of ideas below barely scratches the surface of 
what has helped researchers achieve tremendous progress over the past decade. 


e Novel methods for capacity control, such as dropout (Srivastava et al., 2014), have helped 
to mitigate overfitting. Here, noise is injected (Bishop, 1995) throughout the neural 
network during training. 


e Attention mechanisms solved a second problem that had plagued statistics for over a 
century: how to increase the memory and complexity of a system without increasing 
the number of learnable parameters. Researchers found an elegant solution by using 
what can only be viewed as a learnable pointer structure (Bahdanau et al., 2014). 
Rather than having to remember an entire text sequence, e.g., for machine translation 
in a fixed-dimensional representation, all that needed to be stored was a pointer to the 
intermediate state of the translation process. This allowed for significantly increased 
accuracy for long sequences, since the model no longer needed to remember the entire 
sequence before commencing the generation of a new one. 


e Built solely on attention mechanisms, the Transformer architecture (Vaswani et al., 2017) 
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has demonstrated superior scaling behavior: it performs better with an increase in 
dataset size, model size, and amount of training compute (Kaplan et al., 2020). This 
architecture has demonstrated compelling success in a wide range of areas, such as 
natural language processing (Brown et al., 2020, Devlin et al., 2018), computer vision 
(Dosovitskiy et al., 2021, Liu et al., 2021), speech recognition (Gulati et al., 2020), 
reinforcement learning (Chen et al., 2021), and graph neural networks (Dwivedi and 
Bresson, 2020). For example, a single Transformer pretrained on modalities as diverse 
as text, images, joint torques, and button presses can play Atari, caption images, chat, 
and control a robot (Reed et al., 2022). 


e Modeling probabilities of text sequences, language models can predict text given other 
text. Scaling up the data, model, and compute has unlocked a growing number of 
capabilities of language models to perform desired tasks via human-like text genera- 
tion based on input text (Anil et al., 2023, Brown et al., 2020, Chowdhery et al., 2022, 
Hoffmann et al., 2022, OpenAI, 2023, Rae et al., 2021, Touvron et al., 2023a, Touvron 
et al., 2023b). For instance, aligning language models with human intent (Ouyang et 
al., 2022), OpenAI’s ChatGPT 30 allows users to interact with it in a conversational 
way to solve problems, such as code debugging and creative writing. 


e Multi-stage designs, e.g., via the memory networks (Sukhbaatar et al., 2015) and the neu- 
ral programmer-interpreter (Reed and De Freitas, 2015) permitted statistical modelers 
to describe iterative approaches to reasoning. These tools allow for an internal state of 
the deep neural network to be modified repeatedly, thus carrying out subsequent steps 
in a chain of reasoning, just as a processor can modify memory for a computation. 


e A key development in deep generative modeling was the invention of generative adver- 
sarial networks (Goodfellow et al., 2014). Traditionally, statistical methods for density 
estimation and generative models focused on finding proper probability distributions 
and (often approximate) algorithms for sampling from them. As a result, these algo- 
rithms were largely limited by the lack of flexibility inherent in the statistical models. 
The crucial innovation in generative adversarial networks was to replace the sampler 
by an arbitrary algorithm with differentiable parameters. These are then adjusted in 
such a way that the discriminator (effectively a two-sample test) cannot distinguish 
fake from real data. Through the ability to use arbitrary algorithms to generate data, 
density estimation was opened up to a wide variety of techniques. Examples of gal- 
loping zebras (Zhu et al., 2017) and of fake celebrity faces (Karras et al., 2017) are 
each testimony to this progress. Even amateur doodlers can produce photorealistic 
images just based on sketches describing the layout of a scene (Park et al., 2019). 


e Furthermore, while the diffusion process gradually adds random noise to data samples, 
diffusion models (Ho et al., 2020, Sohl-Dickstein et al., 2015) learn the denoising pro- 
cess to gradually construct data samples from random noise, reversing the diffusion 
process. They have started to replace generative adversarial networks in more recent 
deep generative models, such as in DALL-E 2 (Ramesh et al., 2022) and Imagen (Sa- 
haria et al., 2022) for creative art and image generation based on text descriptions. 


e In many cases, a single GPU is insufficient for processing the large amounts of data 
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available for training. Over the past decade the ability to build parallel and distributed 
training algorithms has improved significantly. One of the key challenges in designing 
scalable algorithms is that the workhorse of deep learning optimization, stochastic 
gradient descent, relies on relatively small minibatches of data to be processed. At 
the same time, small batches limit the efficiency of GPUs. Hence, training on 1,024 
GPUs with a minibatch size of, say, 32 images per batch amounts to an aggregate 
minibatch of about 32,000 images. Work, first by Li (2017) and subsequently by You 


m 
in] 


31 T et al. (2017) and Jia et al. (2018) pushed the size up to 64,000 observations, reducing 
E training time for the ResNet-50 model on the ImageNet dataset to less than 7 minutes. 
By comparison, training times were initially of the order of days. 
m 
32 pe 4 @ The ability to parallelize computation has also contributed to progress in reinforcement 


learning. This has led to significant progress in computers achieving superhuman 
performance on tasks like Go, Atari games, Starcraft, and in physics simulations (e.g., 
using MuJoCo) where environment simulators are available. See, e.g., Silver et al. 
(2016) for a description of such achievements in AlphaGo. In a nutshell, reinforcement 
learning works best if plenty of (state, action, reward) tuples are available. Simulation 
Ela provides such an avenue. 


is 


e Deep learning frameworks have played a crucial role in disseminating ideas. The first 
generation of open-source frameworks for neural network modeling consisted of Caffe 


35 RERE 31 Torch 32, and Theano 33. Many seminal papers were written using these tools. 
ia These have now been superseded by TensorFlow** (often used via its high-level API 
Keras 3° ), CNTK °° , Caffe 237, and Apache MXNet °°. The third generation of 

36 HRAN frameworks consists of so-called imperative tools for deep learning, a trend that was 
es arguably ignited by Chainer 3° , which used a syntax similar to Python NumPy to 


describe models. This idea was adopted by both PyTorch*°, the Gluon API 4! of 
MXNet, and JAX??. 


The division of labor between system researchers building better tools and statistical mod- 
elers building better neural networks has greatly simplified things. For instance, training a 
Eee] linear logistic regression model used to be a nontrivial homework problem, worthy to give 
five to new machine learning Ph.D. students at Carnegie Mellon University in 2014. By now, 
this task can be accomplished with under 10 lines of code, putting it firmly within the reach 
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mam Of any programmer. 
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1.6 Success Stories 


Al 


Artificial intelligence has a long history of delivering results that would be difficult to ac- 
complish otherwise. For instance, mail sorting systems using optical character recognition 
have been deployed since the 1990s. This is, after all, the source of the famous MNIST 
Ei. dataset of handwritten digits. The same applies to reading checks for bank deposits and 
scoring creditworthiness of applicants. Financial transactions are checked for fraud auto- 
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matically. This forms the backbone of many e-commerce payment systems, such as PayPal, 
Stripe, AliPay, WeChat, Apple, Visa, and MasterCard. Computer programs for chess have 
been competitive for decades. Machine learning feeds search, recommendation, personal- 
ization, and ranking on the Internet. In other words, machine learning is pervasive, albeit 
often hidden from sight. 


It is only recently that AI has been in the limelight, mostly due to solutions to problems that 
were considered intractable previously and that are directly related to consumers. Many of 
such advances are attributed to deep learning. 


e Intelligent assistants, such as Apple’s Siri, Amazon’s Alexa, and Google’s assistant, are 
able to respond to spoken requests with a reasonable degree of accuracy. This in- 
cludes menial jobs, like turning on light switches, and more complex tasks, such as 
arranging barber’s appointments and offering phone support dialog. This is likely the 
most noticeable sign that AI is affecting our lives. 


e A key ingredient in digital assistants is their ability to recognize speech accurately. The 
accuracy of such systems has gradually increased to the point of achieving parity with 
humans for certain applications (Xiong et al., 2018). 


e Object recognition has likewise come a long way. Identifying the object in a picture was 
a fairly challenging task in 2010. On the ImageNet benchmark researchers from NEC 
Labs and University of Illinois at Urbana-Champaign achieved a top-five error rate 
of 28% (Lin et al., 2010). By 2017, this error rate was reduced to 2.25% (Hu et al., 
2018). Similarly, stunning results have been achieved for identifying birdsong and for 
diagnosing skin cancer. 


e Prowess in games used to provide a measuring stick for human ability. Starting from 
TD-Gammon, a program for playing backgammon using temporal difference rein- 
forcement learning, algorithmic and computational progress has led to algorithms for 
a wide range of applications. Compared with backgammon, chess has a much more 
complex state space and set of actions. DeepBlue beat Garry Kasparov using mas- 
sive parallelism, special-purpose hardware and efficient search through the game tree 
(Campbell et al., 2002). Go is more difficult still, due to its huge state space. AlphaGo 
reached human parity in 2015, using deep learning combined with Monte Carlo tree 
sampling (Silver et al., 2016). The challenge in Poker was that the state space is large 
and only partially observed (we do not know the opponents’ cards). Libratus exceeded 
human performance in Poker using efficiently structured strategies (Brown and Sand- 
holm, 2017). 


e Another indication of progress in AI is the advent of self-driving vehicles. While full 
autonomy is not yet within reach, excellent progress has been made in this direction, 
with companies such as Tesla, NVIDIA, and Waymo shipping products that enable 
partial autonomy. What makes full autonomy so challenging is that proper driving 
requires the ability to perceive, to reason and to incorporate rules into a system. At 
present, deep learning is used primarily in the visual aspect of these problems. The 
rest is heavily tuned by engineers. 
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This barely scratches the surface of significant applications of machine learning. For in- 
stance, robotics, logistics, computational biology, particle physics, and astronomy owe 
some of their most impressive recent advances at least in parts to machine learning, which 
is thus becoming a ubiquitous tool for engineers and scientists. 


Frequently, questions about a coming AI apocalypse and the plausibility of a singularity 
have been raised in non-technical articles. The fear is that somehow machine learning 
systems will become sentient and make decisions, independently of their programmers, 
that directly impact the lives of humans. To some extent, AI already affects the livelihood 
of humans in direct ways: creditworthiness is assessed automatically, autopilots mostly 
navigate vehicles, decisions about whether to grant bail use statistical data as input. More 
frivolously, we can ask Alexa to switch on the coffee machine. 


Fortunately, we are far from a sentient AI system that could deliberately manipulate its 
human creators. First, AI systems are engineered, trained, and deployed in a specific, goal- 
oriented manner. While their behavior might give the illusion of general intelligence, it is a 
combination of rules, heuristics and statistical models that underlie the design. Second, at 
present, there are simply no tools for artificial general intelligence that are able to improve 
themselves, reason about themselves, and that are able to modify, extend, and improve their 
own architecture while trying to solve general tasks. 


A much more pressing concern is how AI is being used in our daily lives. It is likely that 
many routine tasks, currently fulfilled by humans, can and will be automated. Farm robots 
will likely reduce the costs for organic farmers but they will also automate harvesting op- 
erations. This phase of the industrial revolution may have profound consequences for large 
swaths of society, since menial jobs provide much employment in many countries. Fur- 
thermore, statistical models, when applied without care, can lead to racial, gender, or age 
bias and raise reasonable concerns about procedural fairness if automated to drive conse- 
quential decisions. It is important to ensure that these algorithms are used with care. With 
what we know today, this strikes us as a much more pressing concern than the potential of 
malevolent superintelligence for destroying humanity. 


1.7 The Essence of Deep Learning 
| 


Thus far, we have talked in broad terms about machine learning. Deep learning is the subset 
of machine learning concerned with models based on many-layered neural networks. It is 
deep in precisely the sense that its models learn many layers of transformations. While this 
might sound narrow, deep learning has given rise to a dizzying array of models, techniques, 
problem formulations, and applications. Many intuitions have been developed to explain 
the benefits of depth. Arguably, all machine learning has many layers of computation, the 
first consisting of feature processing steps. What differentiates deep learning is that the 
operations learned at each of the many layers of representations are learned jointly from 
data. 
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The problems that we have discussed so far, such as learning from the raw audio signal, the 
raw pixel values of images, or mapping between sentences of arbitrary lengths and their 
counterparts in foreign languages, are those where deep learning excels and traditional 
methods falter. It turns out that these many-layered models are capable of addressing low- 
level perceptual data in a way that previous tools could not. Arguably the most significant 
commonality in deep learning methods is end-to-end training. That is, rather than assem- 
bling a system based on components that are individually tuned, one builds the system and 
then tunes their performance jointly. For instance, in computer vision scientists used to 
separate the process of feature engineering from the process of building machine learn- 
ing models. The Canny edge detector (Canny, 1987) and Lowe’s SIFT feature extractor 
(Lowe, 2004) reigned supreme for over a decade as algorithms for mapping images into 
feature vectors. In bygone days, the crucial part of applying machine learning to these 
problems consisted of coming up with manually-engineered ways of transforming the data 
into some form amenable to shallow models. Unfortunately, there is only so much that 
humans can accomplish by ingenuity in comparison with a consistent evaluation over mil- 
lions of choices carried out automatically by an algorithm. When deep learning took over, 
these feature extractors were replaced by automatically tuned filters that yielded superior 
accuracy. 


Thus, one key advantage of deep learning is that it replaces not only the shallow models at 
the end of traditional learning pipelines, but also the labor-intensive process of feature engi- 
neering. Moreover, by replacing much of the domain-specific preprocessing, deep learning 
has eliminated many of the boundaries that previously separated computer vision, speech 
recognition, natural language processing, medical informatics, and other application areas, 
thereby offering a unified set of tools for tackling diverse problems. 


Beyond end-to-end training, we are experiencing a transition from parametric statistical 
descriptions to fully nonparametric models. When data is scarce, one needs to rely on sim- 
plifying assumptions about reality in order to obtain useful models. When data is abundant, 
these can be replaced by nonparametric models that better fit the data. To some extent, this 
mirrors the progress that physics experienced in the middle of the previous century with 
the availability of computers. Rather than solving by hand parametric approximations of 
how electrons behave, one can now resort to numerical simulations of the associated par- 
tial differential equations. This has led to much more accurate models, albeit often at the 
expense of interpretation. 


Another difference from previous work is the acceptance of suboptimal solutions, dealing 
with nonconvex nonlinear optimization problems, and the willingness to try things before 
proving them. This new-found empiricism in dealing with statistical problems, combined 
with a rapid influx of talent has led to rapid progress in the development of practical algo- 
rithms, albeit in many cases at the expense of modifying and re-inventing tools that existed 
for decades. 


In the end, the deep learning community prides itself on sharing tools across academic and 
corporate boundaries, releasing many excellent libraries, statistical models, and trained 
networks as open source. It is in this spirit that the notebooks forming this book are freely 
available for distribution and use. We have worked hard to lower the barriers of access for 
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anyone wishing to learn about deep learning and we hope that our readers will benefit from 
this. 


1.8 Summary 
—eEeEeEeEeE———————— EE 


Machine learning studies how computer systems can leverage experience (often data) to 
improve performance at specific tasks. It combines ideas from statistics, data mining, and 
optimization. Often, it is used as a means of implementing AI solutions. As a class of 
machine learning, representational learning focuses on how to automatically find the ap- 
propriate way to represent data. Considered as multi-level representation learning through 
learning many layers of transformations, deep learning replaces not only the shallow mod- 
els at the end of traditional machine learning pipelines, but also the labor-intensive process 
of feature engineering. Much of the recent progress in deep learning has been triggered 
by an abundance of data arising from cheap sensors and Internet-scale applications, and 
by significant progress in computation, mostly through GPUs. Furthermore, the availabil- 
ity of efficient deep learning frameworks has made design and implementation of whole 
system optimization significantly easier, and this is a key component in obtaining high 
performance. 


1.9 Exercises 
S| 


1. Which parts of code that you are currently writing could be “learned”, i.e., improved 
by learning and automatically determining design choices that are made in your code? 
Does your code include heuristic design choices? What data might you need to learn 
the desired behavior? 


2. Which problems that you encounter have many examples for their solution, yet no spe- 
cific way for automating them? These may be prime candidates for using deep learning. 


3. Describe the relationships between algorithms, data, and computation. How do char- 
acteristics of the data and the current available computational resources influence the 
appropriateness of various algorithms? 


4. Name some settings where end-to-end training is not currently the default approach but 
where it might be useful. 


Discussions*?. 
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To prepare for your dive into deep learning, you will need a few survival skills: (4) tech- 
niques for storing and manipulating data; (ii) libraries for ingesting and preprocessing data 
from a variety of sources; (iii) knowledge of the basic linear algebraic operations that we 
apply to high-dimensional data elements; (iv) just enough calculus to determine which di- 
rection to adjust each parameter in order to decrease the loss function; (v) the ability to 
automatically compute derivatives so that you can forget much of the calculus you just 
learned; (vi) some basic fluency in probability, our primary language for reasoning under 
uncertainty; and (vii) some aptitude for finding answers in the official documentation when 
you get stuck. 


In short, this chapter provides a rapid introduction to the basics that you will need to follow 
most of the technical content in this book. 


2.1 Data Manipulation 
T) 


In order to get anything done, we need some way to store and manipulate data. Generally, 
there are two important things we need to do with data: (i) acquire them; and (ii) process 
them once they are inside the computer. There is no point in acquiring data without some 
way to store it, so to start, let’s get our hands dirty with n-dimensional arrays, which we 
also call tensors. If you already know the NumPy scientific computing package, this will be 
a breeze. For all modern deep learning frameworks, the tensor class (ndarray in MXNet, 
Tensor in PyTorch and TensorFlow) resembles NumPy’s ndarray, with a few killer fea- 
tures added. First, the tensor class supports automatic differentiation. Second, it leverages 
GPUs to accelerate numerical computation, whereas NumPy only runs on CPUs. These 
properties make neural networks both easy to code and fast to run. 


2.1.1 Getting Started 


To start, we import the PyTorch library. Note that the package name is torch. 


import torch 


A tensor represents a (possibly multidimensional) array of numerical values. In the one- 
dimensional case, i.e., when only one axis is needed for the data, a tensor is called a vector. 
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With two axes, a tensor is called a matrix. With k > 2 axes, we drop the specialized names 
and just refer to the object as a k®-order tensor. 


PyTorch provides a variety of functions for creating new tensors prepopulated with values. 
For example, by invoking arange(n), we can create a vector of evenly spaced values, start- 
ing at 0 (included) and ending at n (not included). By default, the interval size is 1. Unless 
otherwise specified, new tensors are stored in main memory and designated for CPU-based 
computation. 


x = torch.arange(12, dtype=torch.float32) 
x 
tensor([ @., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.]) 


Each of these values is called an element of the tensor. The tensor x contains 12 elements. 
We can inspect the total number of elements in a tensor via its numel method. 


x.numel () 


12 


We can access a tensor’s shape (the length along each axis) by inspecting its shape attribute. 
Because we are dealing with a vector here, the shape contains just a single element and is 
identical to the size. 


x. shape 


torch. Size([12]) 


We can change the shape of a tensor without altering its size or values, by invoking reshape. 
For example, we can transform our vector x whose shape is (12,) to a matrix X with shape 
(3, 4). This new tensor retains all elements but reconfigures them into a matrix. Notice that 
the elements of our vector are laid out one row at a time and thus x[3] == X[@, 3]. 


X = x.reshape(3, 4) 
X 


tensor (L y Hes Zig Tels 


[ 0. 
[4., 5., 6., 7.], 
[ 8., 9., 10., 11.]]) 


Note that specifying every shape component to reshape is redundant. Because we already 
know our tensor’s size, we can work out one component of the shape given the rest. For 
example, given a tensor of size n and target shape (h, w), we know that w = n/h. To 
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automatically infer one component of the shape, we can place a -1 for the shape component 
that should be inferred automatically. In our case, instead of calling x. reshape(3, 4), we 
could have equivalently called x. reshape(-1, 4) or x.reshape(3, -1). 


Practitioners often need to work with tensors initialized to contain all Os or 1s. We can 
construct a tensor with all elements set to 0 and a shape of (2, 3, 4) via the zeros func- 
tion. 


torch.zeros((2, 3, 4)) 


tensor([L[[@., 0., ©., @.], 
[0., @. z 
[@., ©., @., 0.]], 


© 
© 
© 
Lu 


CEO., ©., ©., 0.], 
[Q., Q. Jd, 
Co., @., ©., 0.]]]) 


© 
© 
O O 


Similarly, we can create a tensor with all 1s by invoking ones. 


torch.ones((2, 3, 4)) 


tensor([[[1., 1., 1., 1.], 
[da des. te, 
Elie tes La 


pipa 
ww 
wis 


[1.5 Las ee il, 
E deg A AD 


We often wish to sample each element randomly (and independently) from a given prob- 
ability distribution. For example, the parameters of neural networks are often initialized 
randomly. The following snippet creates a tensor with elements drawn from a standard 
Gaussian (normal) distribution with mean 0 and standard deviation 1. 


torch.randn(3, 4) 


tensor([[ 0.1351, -@.9099, -0.2028, 2.1937], 
[-0.3200, -0.7545, 0.8086, -1.8730], 
[ 0.3929, 0.4931, 0.9114, -0.7072]]) 


Finally, we can construct tensors by supplying the exact values for each element by sup- 
plying (possibly nested) Python list(s) containing numerical literals. Here, we construct a 
matrix with a list of lists, where the outermost list corresponds to axis 0, and the inner list 
corresponds to axis 1. 
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oir. UENSOGELZ, 1, 4, Bi, Il, 2. Sy “Hyp (4, 8, 2 Wap) 


tensor([[2, 1, 4, 3], 
Li, 2; 3; 4], 
[4, 3, 2, 1]]) 


2.1.2 Indexing and Slicing 


As with Python lists, we can access tensor elements by indexing (starting with 0). To access 
an element based on its position relative to the end of the list, we can use negative indexing. 
Finally, we can access whole ranges of indices via slicing (e.g., X[start:stop]), where 
the returned value includes the first index (start) but not the last (stop). Finally, when 
only one index (or slice) is specified for a k"*-order tensor, it is applied along axis 0. Thus, 
in the following code, [-1] selects the last row and [1:3] selects the second and third 
TOWS. 


ME, XTE 


(tensor([ 8., 9., 10., 11.]), 
tensor([[ 4., 5., 6., 7.], 
[ 8., 9., 10., 11.]])) 


Beyond reading them, we can also write elements of a matrix by specifying indices. 


Kii 2 = iy 
X 
tensor (L Ti, 2er Beals 


Lo., 
[4., 5., 17., 7.], 
[ 8., 9., 10., 11.11) 


If we want to assign multiple elements the same value, we apply the indexing on the left- 
hand side of the assignment operation. For instance, [:2, :] accesses the first and second 
rows, where : takes all the elements along axis 1 (column). While we discussed indexing 
for matrices, this also works for vectors and for tensors of more than two dimensions. 


tensor ([[12., 12., 12., 12.], 
Z i2 do. Te, 
[8., 9., 10., 11.]]) 
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2.1.3 Operations 


Now that we know how to construct tensors and how to read from and write to their ele- 
ments, we can begin to manipulate them with various mathematical operations. Among the 
most useful of these are the elementwise operations. These apply a standard scalar opera- 
tion to each element of a tensor. For functions that take two tensors as inputs, elementwise 
operations apply some standard binary operator on each pair of corresponding elements. 
We can create an elementwise function from any function that maps from a scalar to a 
scalar. 


In mathematical notation, we denote such unary scalar operators (taking one input) by the 
signature f : R — R. This just means that the function maps from any real number onto 
some other real number. Most standard operators, including unary ones like e*, can be 
applied elementwise. 


torch. exp(x) 


tensor([162754.7969, 162754.7969, 162754.7969, 162754.7969, 162754.7969, 
162754.7969, 162754.7969, 162754.7969, 2980.9580, 8103. 0840, 
22026.4648, 59874.1406]) 


Likewise, we denote binary scalar operators, which map pairs of real numbers to a (single) 
real number via the signature f : R,R — R. Given any two vectors u and v of the 
same shape, and a binary operator f, we can produce a vector c = F(u, v) by setting 
ci — f(u;, vi) for all i, where c;,u;, and v; are the i? elements of vectors c, u, and v. 
Here, we produced the vector-valued F : R¢,R¢ — Rf by lifting the scalar function to an 
elementwise vector operation. The common standard arithmetic operators for addition (+), 
subtraction (-), multiplication («), division (/), and exponentiation (**) have all been lifted 
to elementwise operations for identically-shaped tensors of arbitrary shape. 


x = torch.tensor([1.0, 2, 4, 8]) 
torch.tensor([2, 2, 2, 2]) 
Yo X= Ya X* Vo X M Yo K KY 


pe 
+ Il 


(tensor([ 3., 4., 6., 10.]), 
tensor([-1., 0., 2., 6.]), 
tensor([ 2., 4., 8., 16.]), 
tensor([0.5000, 1.0000, 2.0000, 4.0000]), 
tensor([ 1., 4., 16., 64.])) 


In addition to elementwise computations, we can also perform linear algebraic operations, 
such as dot products and matrix multiplications. We will elaborate on these in Section 
2.3. 


We can also concatenate multiple tensors, stacking them end-to-end to form a larger one. 
We just need to provide a list of tensors and tell the system along which axis to concatenate. 
The example below shows what happens when we concatenate two matrices along rows 
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(axis 0) instead of columns (axis 1). We can see that the first output’s axis-O length (6) is 
the sum of the two input tensors’ axis-0 lengths (3 + 3); while the second output’s axis-1 
length (8) is the sum of the two input tensors’ axis-1 lengths (4 + 4). 


l 


X = torch.arange(12, dtype=torch.float32).reshape((3,4)) 
VY = topes tenson I2 0mm A a 2 
torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1) 


(tensor([[ 0., 1., 2., 3.1], 
LA Si. 65: Fall 
[8., 9., 10., 11.], 
[2g Dey Ay Bids 
ia 23 3 45], 
Cia B35. Bes TIIS 
tensor ([[ 0., 1., 2., 3., 2., 1., 4., 3.-], 
Ede Be, Bar Fey Tar Bey Bey. Al 
C S Oe, o Ae Sey: 2a ADY 
Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an 


example. For each position i, j,if X[i, j]andY[i, j] are equal, then the corresponding 
entry in the result takes value 1, otherwise it takes value Q. 


tensor([[False, True, False, True], 
[False, False, False, False], 
[False, False, False, False]]) 


Summing all the elements in the tensor yields a tensor with only one element. 


X.sum() 


tensor (66. ) 


2.1.4 Broadcasting 


By now, you know how to perform elementwise binary operations on two tensors of the 
same shape. Under certain conditions, even when shapes differ, we can still perform ele- 
mentwise binary operations by invoking the broadcasting mechanism. Broadcasting works 
according to the following two-step procedure: (i) expand one or both arrays by copying 
elements along axes with length 1 so that after this transformation, the two tensors have the 
same shape; (ii) perform an elementwise operation on the resulting arrays. 


a = torch. arange(3).reshape((3, 1)) 
b = torch. arange(2).reshape((1, 2)) 
a, b 
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(tensor(L[@], 
[1], 
[2]1]), 
tensor([L2, 1]])) 


Since a and b are 3 x 1 and | x 2 matrices, respectively, their shapes do not match up. 
Broadcasting produces a larger 3 x 2 matrix by replicating matrix a along the columns and 
matrix b along the rows before adding them elementwise. 


a+b 


tensor ([[9, 1], 
Et, 23); 
[2, 3]]) 


2.1.5 Saving Memory 


Running operations can cause new memory to be allocated to host results. For example, if 
we write Y = X + Y, we dereference the tensor that Y used to point to and instead point Y at 
the newly allocated memory. We can demonstrate this issue with Python’s id() function, 
which gives us the exact address of the referenced object in memory. Note that after we 
run Y = Y + X, id(Y) points to a different location. That is because Python first evaluates 
Y + X, allocating new memory for the result and then points Y to this new location in 
memory. 


before = id(Y) 
Y=Y+X 
id(Y) == before 


False 


This might be undesirable for two reasons. First, we do not want to run around allocat- 
ing memory unnecessarily all the time. In machine learning, we often have hundreds of 
megabytes of parameters and update all of them multiple times per second. Whenever 
possible, we want to perform these updates in place. Second, we might point at the same 
parameters from multiple variables. If we do not update in place, we must be careful to 
update all of these references, lest we spring a memory leak or inadvertently refer to stale 
parameters. 


Fortunately, performing in-place operations is easy. We can assign the result of an oper- 
ation to a previously allocated array Y by using slice notation: Y[:] = <expression>. 
To illustrate this concept, we overwrite the values of tensor Z, after initializing it, using 
zeros_like, to have the same shape as Y. 
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Z = torch.zeros_like(Y) 
Prine dC aes acl@Z))) 
AEI = 2 ae N 

PEInt@ id (Zs. 1d(Z)) 


id(Z): 140381179266448 
id(Z): 140381179266448 


If the value of X is not reused in subsequent computations, we can also use X[:] = X + Y 
or X += Y to reduce the memory overhead of the operation. 


before = id(X) 
X += Y 
id(X) == before 


True 


2.1.6 Conversion to Other Python Objects 


Converting to a NumPy tensor (ndarray), or vice versa, is easy. The torch tensor and 
NumPy array will share their underlying memory, and changing one through an in-place 
operation will also change the other. 


A = X.numpy() 
B = torch. from_numpy (A) 
type(A), type(B) 


(numpy.ndarray, torch.Tensor) 


To convert a size-1 tensor to a Python scalar, we can invoke the item function or Python’s 
built-in functions. 


a = torch. tensor([3.5]) 
a, a.item(), float(a), int(a) 


(tensor ([3.5000]), 3.5, 3.5, 3) 


2.1.7 Summary 


The tensor class is the main interface for storing and manipulating data in deep learning li- 
braries. Tensors provide a variety of functionalities including construction routines; index- 
ing and slicing; basic mathematics operations; broadcasting; memory-efficient assignment; 
and conversion to and from other Python objects. 
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2.1.8 Exercises 


1. Run the code in this section. Change the conditional statement X == Y to X < YorX > 
Y, and then see what kind of tensor you can get. 


2. Replace the two tensors that operate by element in the broadcasting mechanism with 
other shapes, e.g., 3-dimensional tensors. Is the result the same as expected? 


pm Discussions **. 
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2.2 Data Preprocessing 
| 


So far, we have been working with synthetic data that arrived in ready-made tensors. How- 

ever, to apply deep learning in the wild we must extract messy data stored in arbitrary 

4g Bugs formats, and preprocess it to suit our needs. Fortunately, the pandas library*? can do much 

mized: of the heavy lifting. This section, while no substitute for a proper pandas tutorial 46 | will 
give you a crash course on some of the most common routines. 


2.2.1 Reading the Dataset 


Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet- 
like) data. In them, each line corresponds to one record and consists of several (comma- 
separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school, field 
of gravitational physics”. To demonstrate how to load CSV files with pandas, we create a 
CSV file below . ./data/house_tiny.csv. This file represents a dataset of homes, where 
each row corresponds to a distinct home and the columns correspond to the number of 
rooms (NumRooms), the roof type (RoofType), and the price (Price). 


import os 
os.makedirs(os.path.join(’..', ‘data’), exist_ok=True) 
data_file = os.path.join(’..'’, ‘'data’, 'house_tiny.csv') 


with open(data_file, ‘w’) as f: 
f.write(’''NumRooms,RoofType, Price 

NA, NA, 127500 

2,NA, 106000 

4,Slate, 178100 

NA, NA, 140000’'') 


Now let’s import pandas and load the dataset with read_csv. 


import pandas as pd 


data = pd.read_csv(data_file) 
print (data) 
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NumRooms RoofType Price 


i) NaN NaN 127500 
1 250 NaN 106000 
2 4.0 Slate 178100 
3 NaN NaN 140000 


2.2.2 Data Preparation 


In supervised learning, we train models to predict a designated target value, given some 
set of input values. Our first step in processing the dataset is to separate out columns cor- 
responding to input versus target values. We can select columns either by name or via 
integer-location based indexing (iloc). 


You might have noticed that pandas replaced all CSV entries with value NA with a spe- 
cial NaN (not a number) value. This can also happen whenever an entry is empty, e.g., 
“3,,.270000”. These are called missing values and they are the “bed bugs” of data science, 
a persistent menace that you will confront throughout your career. Depending upon the 
context, missing values might be handled either via imputation or deletion. Imputation re- 
places missing values with estimates of their values while deletion simply discards either 
those rows or those columns that contain missing values. 


Here are some common imputation heuristics. For categorical input fields, we can treat NaN 
as acategory. Since the RoofType column takes values Slate and NaN, pandas can convert 
this column into two columns RoofType_Slate and RoofType_nan. A row whose roof type 
is Slate will set values of Roof Type_Slate and RoofType_nan to 1 and 0, respectively. 
The converse holds for a row with a missing RoofType value. 


inputs, targets = data.iloc[:, 0:2], data.ilocLl:, 2] 
inputs = pd.get_dummies(inputs, dummy_na=True) 
print(inputs) 


NumRooms RoofType_Slate RoofType_nan 


(] NaN False True 
1 2.0 False True 
2 4.0 True False 
3 NaN False True 


For missing numerical values, one common heuristic is to replace the NaN entries with the 
mean value of the corresponding column. 


inputs = inputs.fillna(inputs.mean()) 
print(inputs) 


NumRooms RoofType_Slate RoofType_nan 
(2 3.0 False True 
1 2.0 False True 


(continues on next page) 
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(continued from previous page) 


True False 
False True 


N 
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2.2.3 Conversion to the Tensor Format 


Now that all the entries in inputs and targets are numerical, we can load them into a 
tensor (recall Section 2.1). 


import torch 


X = torch. tensor(inputs.to_numpy (dtype=float)) 
y = torch. tensor(targets.to_numpy(dtype=float)) 
X, y 
(tensor ([[3., ð., 1.], 
EAR A T 
[4., 1., 0.], 
[3., ©., 1.]], dtype=torch.float64), 
tensor([127500., 106000., 178100., 140000.], dtype=torch.float64)) 


2.2.4 Discussion 


You now know how to partition data columns, impute missing variables, and load pan- 
das data into tensors. In Section 5.7, you will pick up some more data processing skills. 
While this crash course kept things simple, data processing can get hairy. For example, 
rather than arriving in a single CSV file, our dataset might be spread across multiple files 
extracted from a relational database. For instance, in an e-commerce application, customer 
addresses might live in one table and purchase data in another. Moreover, practitioners face 
myriad data types beyond categorical and numeric, for example, text strings, images, audio 
data, and point clouds. Oftentimes, advanced tools and efficient algorithms are required 
in order to prevent data processing from becoming the biggest bottleneck in the machine 
learning pipeline. These problems will arise when we get to computer vision and natural 
language processing. Finally, we must pay attention to data quality. Real-world datasets are 


T i a 

a H 

El 

TIR often plagued by outliers, faulty measurements from sensors, and recording errors, which 
m: 
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must be addressed before feeding the data into any model. Data visualization tools such as 


seaborn*”, Bokeh*®, or matplotlib*? can help you to manually inspect the data and develop 
intuitions about the type of problems you may need to address. 


PEDI 
49 Pim- 
sy 
inves! 


2.2.5 Exercises 


inspect their properties. What fraction of them has missing values? What fraction of 
the variables is numerical, categorical, or text? 


2. Try indexing and selecting data columns by name rather than by column number. The 
pandas documentation on indexing®! has further details on how to do this. 
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3. How large a dataset do you think you could load this way? What might be the limita- 
tions? Hint: consider the time to read the data, representation, processing, and memory 
footprint. Try this out on your laptop. What happens if you try it out on a server? 


4. How would you deal with data that has a very large number of categories? What if the 
category labels are all unique? Should you include the latter? 


5. What alternatives to pandas can you think of? How about loading NumPy tensors from 
a file5?? Check out Pillow”? , the Python Imaging Library. 


4 


2.3 Linear Algebra 


i By now, we can load datasets into tensors and manipulate these tensors with basic math- 


ematical operations. To start building sophisticated models, we will also need a few tools 
from linear algebra. This section offers a gentle introduction to the most essential concepts, 
starting from scalar arithmetic and ramping up to matrix multiplication. 


import torch 


2.3.1 Scalars 


Most everyday mathematics consists of manipulating numbers one at a time. Formally, we 
call these values scalars. For example, the temperature in Palo Alto is a balmy 72 degrees 
Fahrenheit. If you wanted to convert the temperature to Celsius you would evaluate the 
expression c = 3( f — 32), setting f to 72. In this equation, the values 5, 9, and 32 are 
constant scalars. The variables c and f in general represent unknown scalars. 


We denote scalars by ordinary lower-cased letters (e.g., x, y, and z) and the space of all 
(continuous) real-valued scalars by R. For expedience, we will skip past rigorous defini- 
tions of spaces: just remember that the expression x € R is a formal way to say that x is 
a real-valued scalar. The symbol € (pronounced “in”) denotes membership in a set. For 
example, x,y € {0,1} indicates that x and y are variables that can only take values 0 or 
l. 


Scalars are implemented as tensors that contain only one element. Below, we assign two 
scalars and perform the familiar addition, multiplication, division, and exponentiation op- 
erations. 


x< 
lI 


torch. tensor (3.0) 
torch. tensor (2.0) 


< 
I 


X +y, X* y, X/ y, X*žy 
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(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.)) 


2.3.2 Vectors 


For current purposes, you can think of a vector as a fixed-length array of scalars. As with 
their code counterparts, we call these scalars the elements of the vector (synonyms include 
entries and components). When vectors represent examples from real-world datasets, their 
values hold some real-world significance. For example, if we were training a model to 
predict the risk of a loan defaulting, we might associate each applicant with a vector whose 
components correspond to quantities like their income, length of employment, or number of 
previous defaults. If we were studying the risk of heart attack, each vector might represent 
a patient and its components might correspond to their most recent vital signs, cholesterol 
levels, minutes of exercise per day, etc. We denote vectors by bold lowercase letters, (e.g., 
xX, y, and z). 


Vectors are implemented as 1*-order tensors. In general, such tensors can have arbitrary 
lengths, subject to memory limitations. Caution: in Python, as in most programming lan- 
guages, vector indices start at 0, also known as zero-based indexing, whereas in linear 
algebra subscripts begin at 1 (one-based indexing). 


x = torch. arange(3) 
x 
tensor ([9, 1, 2]) 


We can refer to an element of a vector by using a subscript. For example, x2 denotes the 
second element of x. Since x2 is a scalar, we do not bold it. By default, we visualize vectors 
by stacking their elements vertically. 


x] 
x=]: ], (2.3.1) 
Xn 
Here x1,...,Xn are elements of the vector. Later on, we will distinguish between such 


column vectors and row vectors whose elements are stacked horizontally. Recall that we 
access a tensor’s elements via indexing. 


x[2] 


tensor (2) 


To indicate that a vector contains n elements, we write x € R”. Formally, we call n the 
dimensionality of the vector. In code, this corresponds to the tensor’s length, accessible via 
Python’s built-in len function. 
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len(x) 


We can also access the length via the shape attribute. The shape is a tuple that indicates 
a tensor’s length along each axis. Tensors with just one axis have shapes with just one 
element. 


x. shape 


torch. Size([3]) 


Oftentimes, the word “dimension” gets overloaded to mean both the number of axes and the 
length along a particular axis. To avoid this confusion, we use order to refer to the number 
of axes and dimensionality exclusively to refer to the number of components. 


2.3.3 Matrices 


Just as scalars are 0'"-order tensors and vectors are 1‘'-order tensors, matrices are 2"-order 
tensors. We denote matrices by bold capital letters (e.g., X, Y, and Z), and represent them 
in code by tensors with two axes. The expression A € R’”*” indicates that a matrix A 
contains m x n real-valued scalars, arranged as m rows and n columns. When m = n, we 
say that a matrix is square. Visually, we can illustrate any matrix as a table. To refer to an 
individual element, we subscript both the row and column indices, e.g., a;; is the value that 
belongs to A’s i row and j® column: 


áil a\2 Hi din 
a21 a22 EEG 42n 

i a cal (2.3.2) 
Ami Am2 *** Amn 


In code, we represent a matrix A € R””” by a 2”4-order tensor with shape (m, n). We can 
convert any appropriately sized m X n tensor into an m X n matrix by passing the desired 
shape to reshape: 


A = torch.arange(6).reshape(3, 2) 
A 


tensor(LLQ, 1], 
[2,34 
[4, 5]]) 


Sometimes we want to flip the axes. When we exchange a matrix’s rows and columns, the 
result is called its transpose. Formally, we signify a matrix A’s transpose by A‘ and if 
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B = A", then bij = aji for alli and j. Thus, the transpose of an m x n matrix is ann xX m 


matrix: 
411 a2) ess Am1 
F 412 a2 aes Am2 
rs ia (2.3.3) 
Gin 42n ... Amn 


In code, we can access any matrix’s transpose as follows: 


A.T 


tensor ([[9, 2, 4], 
EL 3, 51]) 


Symmetric matrices are the subset of square matrices that are equal to their own transposes: 
A =A. The following matrix is symmetric: 


A = torch.tensor([[1, 2, 3], [2, @, 4], [3, 4, 5]]) 
== A.T 


tensor(L[[True, True, True], 
[True, True, True], 
[True, True, True]]) 


Matrices are useful for representing datasets. Typically, rows correspond to individual 
records and columns correspond to distinct attributes. 


2.3.4 Tensors 


While you can go far in your machine learning journey with only scalars, vectors, and 
matrices, eventually you may need to work with higher-order tensors. Tensors give us 
a generic way of describing extensions to n'*-order arrays. We call software objects of 
the tensor class “tensors” precisely because they too can have arbitrary numbers of axes. 
While it may be confusing to use the word tensor for both the mathematical object and its 
realization in code, our meaning should usually be clear from context. We denote general 
tensors by capital letters with a special font face (e.g., X, Y, and Z) and their indexing 
mechanism (e.g., x; ;, and [X]1, 2-1,3) follows naturally from that of matrices. 


Tensors will become more important when we start working with images. Each image 
arrives as a 3"t-order tensor with axes corresponding to the height, width, and channel. At 
each spatial location, the intensities of each color (red, green, and blue) are stacked along the 
channel. Furthermore, a collection of images is represented in code by a 4"-order tensor, 
where distinct images are indexed along the first axis. Higher-order tensors are constructed, 
as were vectors and matrices, by growing the number of shape components. 
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torch. arange(24).reshape(2, 3, 4) 


tensor([[[ ð, 1, 2, 3], 
[4, 5, 6, 7], 
[ 8, 9, 10, 1171, 


[[12, 13, 14, 15], 
[16, 17, 18, 19], 
[20, 21, 22, 23]]]) 


2.3.5 Basic Properties of Tensor Arithmetic 


Scalars, vectors, matrices, and higher-order tensors all have some handy properties. For ex- 
ample, elementwise operations produce outputs that have the same shape as their operands. 


Li 


torch.arange(6, dtype=torch.float32).reshape(2, 3) 
A.clone() # Assign a copy of A to B by allocating new memory 
, A+B 


> WwW > 
I 


(tensor(L[@., 1., 2.], 


[3., 4., 5.1D), 
tensor ([[ 0., 2., 4.], 
[6., 8., 10.]])) 


The elementwise product of two matrices is called their Hadamard product (denoted ©). 
We can spell out the entries of the Hadamard product of two matrices A, B € R”*”: 


aiibi aibi2 ...  dinbin 
aznbn è anbn ... ambn 

AoB=| `. S HE (2.3.4) 
am1bmı am2bm2 te AmnDmn 


tensor([[ ð., 1., 4.], 
E 9:..; 16., 25%J]) 


Adding or multiplying a scalar and a tensor produces a result with the same shape as 
the original tensor. Here, each element of the tensor is added to (or multiplied by) the 
scalar. 


a=2 
torch. arange(24).reshape(2, 3, 4) 
X, (a * X).shape 


o x< 
+ Il 
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(tensor ([[[ 2, 3, 4, 5], 
[ 6, T 8, 9], 
[10, 11, 12, 137], 


[[14, 15, 16, 17], 
[18, 19, 20, 21], 


[22, 23, 24, 25]]]), 
torch.Size(L2, 3, 4])) 


2.3.6 Reduction 


Often, we wish to calculate the sum of a tensor’s elements. To express the sum of the 


elements in a vector x of length n, we write >);"_ x;. There is a simple function for it: 


x = torch.arange(3, dtype=torch. float32) 
x, X.sum() 


(tensor(L@., 1., 2.]), tensor(3.)) 


To express sums over the elements of tensors of arbitrary shape, we simply sum over all 
its axes. For example, the sum of the elements of an m x n matrix A could be written 


m n 
i=l Lit aij. 


A.shape, A.sum() 


(torch.Size([2, 3]), tensor(15.)) 


By default, invoking the sum function reduces a tensor along all of its axes, eventually 
producing a scalar. Our libraries also allow us to specify the axes along which the tensor 
should be reduced. To sum over all elements along the rows (axis 0), we specify axis=Q in 
sum. Since the input matrix reduces along axis 0 to generate the output vector, this axis is 
missing from the shape of the output. 


A.shape, A.sum(axis=0) .shape 


(torch.Size([2, 3]), torch.Size([3])) 


Specifying axis=1 will reduce the column dimension (axis 1) by summing up elements of 
all the columns. 


A.shape, A.sum(axis=1).shape 


(torch.Size([2, 3]), torch.Size([2])) 
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Reducing a matrix along both rows and columns via summation is equivalent to summing 
up all the elements of the matrix. 


A.sum(axis=[0, 1]) == A.sum() # Same as A.sum() 


tensor (True) 


A related quantity is the mean, also called the average. We calculate the mean by dividing 
the sum by the total number of elements. Because computing the mean is so common, it 
gets a dedicated library function that works analogously to sum. 


A.mean(), A.sum() / A.numel() 


(tensor(2.5000), tensor(2.5000)) 


Likewise, the function for calculating the mean can also reduce a tensor along specific 
axes. 


A.mean(axis=0), A.sum(axis=0) / A.shape[Q] 


(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000])) 


2.3.7 Non-Reduction Sum 


Sometimes it can be useful to keep the number of axes unchanged when invoking the func- 
tion for calculating the sum or mean. This matters when we want to use the broadcast 
mechanism. 


sum_A = A.sum(axis=1, keepdims=True) 
sum_A, sum_A. shape 


(tensor ([[ 3.], 
[12.]]), 
torch.Size([2, 1])) 


For instance, since sum_A keeps its two axes after summing each row, we can divide A by 
sum_A with broadcasting to create a matrix where each row sums up to 1. 


A / sum_A 


tensor(L[@.2000, 0.3333, 0.6667], 
[0.2500, 0.3333, @.4167]]) 
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If we want to calculate the cumulative sum of elements of A along some axis, say axis=0 
(row by row), we can call the cumsum function. By design, this function does not reduce 
the input tensor along any axis. 


A.cumsum(axis=0) 


tensor(L[@., 1., 
[3., 5., 


2.3.8 Dot Products 


So far, we have only performed elementwise operations, sums, and averages. And if this was 
all we could do, linear algebra would not deserve its own section. Fortunately, this is where 
things get more interesting. One of the most fundamental operations is the dot product. 
Given two vectors x, y € RÍ, their dot product x‘ y (also known as inner product, (x, yY) 
is a sum over the products of the elements at the same position: x" y = 2 Xiyi- 


y = torch.ones(3, dtype = torch.float32) 
x, y, torch.dot(x, y) 


(tensor([9., 1., 2.]), tensor([1., 1., 1.]), tensor(3.)) 


Equivalently, we can calculate the dot product of two vectors by performing an elementwise 
multiplication followed by a sum: 


torch.sum(x * y) 


tensor (3.) 


Dot products are useful in a wide range of contexts. For example, given some set of val- 
ues, denoted by a vector x € R”, and a set of weights, denoted by w € R”, the weighted 
sum of the values in x according to the weights w could be expressed as the dot product 
x'w. When the weights are nonnegative and sum to 1, i.e., (71, w; = 1), the dot prod- 
uct expresses a weighted average. After normalizing two vectors to have unit length, the 
dot products express the cosine of the angle between them. Later in this section, we will 
formally introduce this notion of length. 


2.3.9 Matrix—Vector Products 


Now that we know how to calculate dot products, we can begin to understand the product 
between an m x n matrix A and an n-dimensional vector x. To start off, we visualize our 
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matrix in terms of its row vectors 


Asji; (2.3.5) 


where each a/ € R” is a row vector representing the i™ row of the matrix A. 


The matrix-vector product Ax is simply a column vector of length m, whose i" element 
is the dot product a/ x: 


al ajx 
a alx 

Ax=|?|x=]|? (2.3.6) 
ay, a,x 


We can think of multiplication with a matrix A € R’”*” as a transformation that projects 
vectors from R” to R”. These transformations are remarkably useful. For example, we can 
represent rotations as multiplications by certain square matrices. Matrix—vector products 
also describe the key calculation involved in computing the outputs of each layer in a neural 
network given the outputs from the previous layer. 


To express a matrix—vector product in code, we use the mv function. Note that the column 
dimension of A (its length along axis 1) must be the same as the dimension of x (its length). 
Python has a convenience operator @ that can execute both matrix—vector and matrix—matrix 
products (depending on its arguments). Thus we can write A@x. 


A.shape, x.shape, torch.mv(A, x), A@x 


(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.])) 


2.3.10 Matrix—Matrix Multiplication 


Once you have gotten the hang of dot products and matrix—vector products, then matrix— 
matrix multiplication should be straightforward. 


Say that we have two matrices A € R”** and B € R”: 


ay) anp > dik biy biz > Dim 
an 422, o ak bo, bn > bm 

A=j. so .|, Bel. oe Taiji (23.7) 
äni an2 *** dnk bki bea >> bkm 


Letaj € R* denote the row vector representing the i™ row of the matrix A and let b j€ RÝ 
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denote the column vector from the j® column of the matrix B: 


A=|°|, B=[bi bo © bml- (2.3.8) 


To form the matrix product C € R”*’", we simply compute each element c;; as the dot 
product between the i'" row of A and the j™ column of B, i.e., a) b;: 


a ajb; ajbz >> a; bm 

al alb) alb} =- albm 
C=AB=|?|[bi bb bal=] 7. p (2.3.9) 

al abı alb al bin 


We can think of the matrix—matrix multiplication AB as performing m matrix-vector prod- 
ucts or m x n dot products and stitching the results together to form an n x m matrix. In the 
following snippet, we perform matrix multiplication on A and B. Here, A is a matrix with 
two rows and three columns, and B is a matrix with three rows and four columns. After 
multiplication, we obtain a matrix with two rows and four columns. 


B = torch.ones(3, 4) 
torch.mm(A, B), AGB 


(tensor ([[ 3., 3., 3., 3.], 
[12., 12., 12., 12.]]), 

tensor([[ 3., 3., 3., 3.], 
[12., 12., 12., 12.]])) 


The term matrix—matrix multiplication is often simplified to matrix multiplication, and 
should not be confused with the Hadamard product. 


2.3.11 Norms 


Some of the most useful operators in linear algebra are norms. Informally, the norm of a 
vector tells us how big it is. For instance, the & norm measures the (Euclidean) length of a 
vector. Here, we are employing a notion of size that concerns the magnitude of a vector’s 
components (not its dimensionality). 


A norm is a function || - || that maps a vector to a scalar and satisfies the following three 
properties: 


1. Given any vector x, if we scale (all elements of) the vector by a scalar œ € R, its norm 
scales accordingly: 


I|ax|| = lællixll. (2.3.10) 
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2. For any vectors x and y: norms satisfy the triangle inequality: 


lx+ yl] < [lxll + Iyl. (2.3.11) 


3. The norm of a vector is nonnegative and it only vanishes if the vector is zero: 


||x|| > 0 for all x #0. (2.3.12) 


Many functions are valid norms and different norms encode different notions of size. The 
Euclidean norm that we all learned in elementary school geometry when calculating the 
hypotenuse of a right triangle is the square root of the sum of squares of a vector’s elements. 
Formally, this is called the 2 norm and expressed as 


(2.3.13) 


IlxIl2 = 


The method norm calculates the f norm. 


u = torch.tensor([3.0, -4.0]) 
torch.norm(u) 


tensor(5.) 


The fı norm is also common and the associated measure is called the Manhattan distance. 
By definition, the £; norm sums the absolute values of a vector’s elements: 


lll = Š, bail. (2.3.14) 
i=1 


Compared to the £2 norm, it is less sensitive to outliers. To compute the fı norm, we 
compose the absolute value with the sum operation. 


torch. abs(u) .sum() 


tensor (7.) 


Both the £2 and ¢; norms are special cases of the more general £,, norms: 


1/p 


bal wae} > (2.3.15) 
i=1 


In the case of matrices, matters are more complicated. After all, matrices can be viewed 
both as collections of individual entries and as objects that operate on vectors and transform 
them into other vectors. For instance, we can ask by how much longer the matrix—vector 
product Xv could be relative to v. This line of thought leads to what is called the spectral 
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norm. For now, we introduce the Frobenius norm, which is much easier to compute and 
defined as the square root of the sum of the squares of a matrix’s elements: 


|X lp = (2.3.16) 


The Frobenius norm behaves as if it were an £2 norm of a matrix-shaped vector. Invoking 
the following function will calculate the Frobenius norm of a matrix. 


torch.norm(torch.ones((4, 9))) 


tensor (6.) 


While we do not want to get too far ahead of ourselves, we already can plant some intu- 
ition about why these concepts are useful. In deep learning, we are often trying to solve 
optimization problems: maximize the probability assigned to observed data; maximize the 
revenue associated with a recommender model; minimize the distance between predictions 
and the ground truth observations; minimize the distance between representations of photos 
of the same person while maximizing the distance between representations of photos of dif- 
ferent people. These distances, which constitute the objectives of deep learning algorithms, 
are often expressed as norms. 


2.3.12 Discussion 


In this section, we have reviewed all the linear algebra that you will need to understand a 
significant chunk of modern deep learning. There is a lot more to linear algebra, though, 
and much of it is useful for machine learning. For example, matrices can be decomposed 
into factors, and these decompositions can reveal low-dimensional structure in real-world 
datasets. There are entire subfields of machine learning that focus on using matrix decom- 
positions and their generalizations to high-order tensors to discover structure in datasets 
and solve prediction problems. But this book focuses on deep learning. And we believe 
you will be more inclined to learn more mathematics once you have gotten your hands dirty 
applying machine learning to real datasets. So while we reserve the right to introduce more 
mathematics later on, we wrap up this section here. 


If you are eager to learn more linear algebra, there are many excellent books and online 
resources. For a more advanced crash course, consider checking out Strang (1993), Kolter 
(2008), and Petersen and Pedersen (2008). 


To recap: 


e Scalars, vectors, matrices, and tensors are the basic mathematical objects used in linear 
algebra and have zero, one, two, and an arbitrary number of axes, respectively. 


e Tensors can be sliced or reduced along specified axes via indexing, or operations such 
as sum and mean, respectively. 
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Elementwise products are called Hadamard products. By contrast, dot products, matrix— 
vector products, and matrix—matrix products are not elementwise operations and in 
general return objects having shapes that are different from the the operands. 


Compared to Hadamard products, matrix—matrix products take considerably longer to 
compute (cubic rather than quadratic time). 


Norms capture various notions of the magnitude of a vector (or matrix), and are com- 
monly applied to the difference of two vectors to measure their distance apart. 


Common vector norms include the £; and 2 norms, and common matrix norms include 
the spectral and Frobenius norms. 


2.3.13 Exercises 


. Prove that the transpose of the transpose of a matrix is the matrix itself: (AT)" = A. 


Given two matrices A and B, show that sum and transposition commute: A’ + B™ = 
(A+B)". 


Given any square matrix A, is A+ A' always symmetric? Can you prove the result by 
using only the results of the previous two exercises? 


We defined the tensor X of shape (2, 3, 4) in this section. What is the output of len(X)? 
Write your answer without implementing any code, then check your answer using code. 


For a tensor X of arbitrary shape, does len(X) always correspond to the length of a 
certain axis of X? What is that axis? 


Run A / A.sum(axis=1) and see what happens. Can you analyze the results? 


When traveling between two points in downtown Manhattan, what is the distance that 
you need to cover in terms of the coordinates, i.e., in terms of avenues and streets? Can 
you travel diagonally? 


Consider a tensor of shape (2, 3, 4). What are the shapes of the summation outputs 
along axes 0, 1, and 2? 


Feed a tensor with three or more axes to the linalg.norm function and observe its 
output. What does this function compute for tensors of arbitrary shape? 


. ° 10 16 16 f] 5 i C NEEN 
Consider three large matrices, say A € R? %2”, B € R? *” and C € R**? , ini- 
tialized with Gaussian random variables. You want to compute the product ABC. Is 


there any difference in memory footprint and speed, depending on whether you compute 
(AB)C or A(BC). Why? 


Consider three large matrices, say A € R2°x2' B e R?“*? and C € R?*?", Is there 
any difference in speed depending on whether you compute AB or ACT? Why? What 
changes if you initialize C = B' without cloning memory? Why? 


Consider three matrices, say A, B, C € R!*?. Construct a tensor with three axes by 


54 Preliminaries 


stacking [A, B, C]. What is the dimensionality? Slice out the second coordinate of the 
third axis to recover B. Check that your answer is correct. 


Discussions”. 


2.4 Calculus 
SSS EEE 


For a long time, how to calculate the area of a circle remained a mystery. Then, in Ancient 
Greece, the mathematician Archimedes came up with the clever idea to inscribe a series of 
polygons with increasing numbers of vertices on the inside of a circle (Fig. 2.4.1). Fora 
polygon with n vertices, we obtain n triangles. The height of each triangle approaches the 
radius r as we partition the circle more finely. At the same time, its base approaches 27r /n, 
since the ratio between arc and secant approaches | for a large number of vertices. Thus, 


the area of the polygon approaches n - r - 5(2ar /n) = ar’. 


| Finding the area of a circle as a limit procedure. 


This limiting procedure is at the root of both differential calculus and integral calculus. The 
former can tell us how to increase or decrease a function’s value by manipulating its argu- 
ments. This comes in handy for the optimization problems that we face in deep learning, 
where we repeatedly update our parameters in order to decrease the loss function. Opti- 
mization addresses how to fit our models to training data, and calculus is its key prerequisite. 
However, do not forget that our ultimate goal is to perform well on previously unseen data. 
That problem is called generalization and will be a key focus of other chapters. 


%matplotlib inline 

import numpy as np 

from matplotlib_inline import backend_inline 
from d21 import torch as d21 


2.4.1 Derivatives and Differentiation 


Put simply, a derivative is the rate of change in a function with respect to changes in its 
arguments. Derivatives can tell us how rapidly a loss function would increase or decrease 
were we to increase or decrease each parameter by an infinitesimally small amount. For- 
mally, for functions f : R — R, that map from scalars to scalars, the derivative of f ata 
point x is defined as 


—— (2.4.1) 


if = li 
f' (x) lim 
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This term on the right hand side is called a limit and it tells us what happens to the value of 
an expression as a specified variable approaches a particular value. This limit tells us what 
the ratio between a perturbation h and the change in the function value f(x + h) — f(x) 
converges to as we shrink its size to zero. 


When f'(x) exists, f is said to be differentiable at x; and when f'(x) exists for all x on a 
set, e.g., the interval [a, b], we say that f is differentiable on this set. Not all functions are 
differentiable, including many that we wish to optimize, such as accuracy and the area under 
the receiving operating characteristic (AUC). However, because computing the derivative 
of the loss is a crucial step in nearly all algorithms for training deep neural networks, we 
often optimize a differentiable surrogate instead. 


We can interpret the derivative f’(x) as the instantaneous rate of change of f(x) with 
respect to x. Let’s develop some intuition with an example. Define u = f(x) = 3x? — 
4x. 


def f(x): 
return 3 * XAA 2- 4x x 


Setting x = 1, we see that amaca approaches 2 as h approaches 0. While this ex- 


periment lacks the rigor of a mathematical proof, we can quickly see that indeed f’(1) = 
2. 


for h in 10.0**np.arange(-1, -6, -1): 
print(f'h={h: .5f}, numerical limit={(f(1th)-f(1))/h: .5f}") 


h=0.1000@, numerical limit=2.30000 
h=0.21000, numerical limit=2.03000 
h=0.20100, numerical limit=2.00300 
h=0.20010, numerical limit=2.00030 
h=0.00001, numerical limit=2.00003 


There are several equivalent notational conventions for derivatives. Given y = f(x), the 
following expressions are equivalent: 


f@ ey =D =F = 4 fe) = Dfe) = Df), (2.4.2) 


where the symbols 4 and D are differentiation operators. Below, we present the deriva- 
tives of some common functions: 


d 
—C=0 for any constant C 
dx 
F =nx""! forn #0 
a (2.4.3) 
ae = e” 
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Functions composed from differentiable functions are often themselves differentiable. The 
following rules come in handy for working with compositions of any differentiable func- 
tions f and g, and constant C. 


d d 
— [Cf (x)] = C— f(x) Constant multiple rule 
dx dx 
d d d 
TI + g(x) = f(x) + 8) Sum rule 
dx dx dx 
d d d (2.4.4) 
—[f(x)g(x)] = f(x) — g(x) + g(x) — f(x) Product rule 
dx dx dx 
a — 4 
d f(x) EOFS = f(x) FB (x) Guten 
dx g(x) g(x) 
Using this, we can apply the rules to find the derivative of 3x? — 4x via 
d -2 d 5 d 
eau — 4x] = 3—x* —4—x = 6x- 4. 2.4.5 
T [3x x]=3 Ke en 6x ( ) 


Plugging in x = 1 shows that, indeed, the derivative equals 2 at this location. Note that 
derivatives tell us the slope of a function at a particular location. 


2.4.2 Visualization Utilities 


We can visualize the slopes of functions using the matplotlib library. We need to de- 
fine a few functions. As its name indicates, use_svg_display tells matplotlib to output 
graphics in SVG format for crisper images. The comment #@save is a special modifier that 
allows us to save any function, class, or other code block to the d21 package so that we can 
invoke it later without repeating the code, e.g., via d21.use_svg_display(). 


def use_svg_display(): #@save 
"""Use the svg format to display a plot in Jupyter. 
backend_inline.set_matplotlib_formats('svg') 


nnn 


Conveniently, we can set figure sizes with set_figsize. Since the import statement from 
matplotlib import pyplot as plt was marked via #@save in the d21 package, we can 
call d2l.plt. 


def set_figsize(figsize=(3.5, 2.5)): #@save 
"""Set the figure size for matplotlib.””” 
use_svg_display() 
d21.plt.rcParams['figure.figsize'] = figsize 


The set_axes function can associate axes with properties, including labels, ranges, and 
scales. 


#@save 

def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend): 
""Set the axes for matplotlib.”"” 
axes.set_xlabel(xlabel), axes.set_ylabel (ylabel) 


(continues on next page) 
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(continued from previous page) 


axes.set_xscale(xscale), axes.set_yscale(yscale) 
axes.set_xlim(xlim) , axes.set_ylim(ylim) 
if legend: 
axes. legend(legend) 
axes. grid() 


With these three functions, we can define a plot function to overlay multiple curves. Much 
of the code here is just ensuring that the sizes and shapes of inputs match. 


#@save 
def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None, 
ylim=None, xscale='linear’, yscale='linear’, 
fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None): 
Plot data pDOINtS. ai 


def has_one_axis(X): # True if X (tensor or list) has 1 axis 
return (hasattr(X, ”ndim”) and X.ndim == 1 or isinstance(X, list) 
and not hasattr(X[@], ”__len__”)) 


if has_one_axis(X): X = [X] 
if Y is None: 

X, Y = [LJ] * len(X), X 
elif has_one_axis(Y): 

Y = [Y] 
if len(X) != len(Y): 

X =X x len(Y) 


set_figsize(figsize) 
if axes is None: 
axes = d2l.plt.gca() 
axes.cla() 
for x, y, fmt in zip(X, Y, fmts): 
axes.plot(x,y,fmt) if len(x) else axes.plot(y, fmt) 
set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend) 


Now we can plot the function u = f(x) and its tangent line y = 2x — 3 at x = 1, where the 
coefficient 2 is the slope of the tangent line. 


x = np.arange(@, 3, @.1) 
plot(x, [f(x), 2 * x - 3], 'x’, 'f(x)', legend=L'f(x)’, 'Tangent line (x=1)']) 


— f(x) 
| --- Tangent line (x=1) 
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2.4.3 Partial Derivatives and Gradients 


Thus far, we have been differentiating functions of just one variable. In deep learning, we 
also need to work with functions of many variables. We briefly introduce notions of the 
derivative that apply to such multivariate functions. 


Let y = f(x1,x2,...,Xn) be a function with n variables. The partial derivative of y with 
respect to its i" parameter x; is 
ð . X15 0005 Xji-1,%j + hy Xj41,...,Xn) — f(X1,.. Xin. 
OY i fC 1 i-1,47 i+] n) fC 1 i n) (2.4.6) 
x; h>0 h 
To calculate 2, we can treat x1,...,Xj;-1,Xj41,---,Xn as constants and calculate the deriva- 


tive of y with respect to x;. The following notational conventions for partial derivatives are 
all common and all mean the same thing: 
dy _ Of 
= -> = ôx f = if = fy = fi = Dif =Dx,f. (2.4.7) 
Ox; Ox; 
We can concatenate partial derivatives of a multivariate function with respect to all its 
variables to obtain a vector that is called the gradient of the function. Suppose that the 
input of function f : R” — R is an n-dimensional vector x = [x,,x2,...,Xn]' and the 
output is a scalar. The gradient of the function f with respect to x is a vector of n partial 
derivatives: 


Vif (x) = [b F, On f, On PO, (2.4.8) 


When there is no ambiguity, Vx f(x) is typically replaced by V f(x). The following rules 
come in handy for differentiating multivariate functions: 


e Forall A € R”*” we have V, Ax = A’ and V,x'’A=A. 


e For square matrices A € R”*” we have that V,x' Ax = (A + A")x and in particular 
Vuxcl[x||? = Vicx x = 2x. 


Similarly, for any matrix X, we have VxIIXIlF = 2X. 


2.4.4 Chain Rule 


In deep learning, the gradients of concern are often difficult to calculate because we are 
working with deeply nested functions (of functions (of functions. ..)). Fortunately, the chain 
rule takes care of this. Returning to functions of a single variable, suppose that y = f(g(x)) 
and that the underlying functions y = f (u) and u = g(x) are both differentiable. The chain 
rule states that 
dy _ dy du 
dx du dx’ 
Turning back to multivariate functions, suppose that y = f (u) has variables u1, u2, . . . , Um, 
where each u; = g;(x) has variables x1, x2,...,Xn, i.e., u = g(x). Then the chain rule 
states that 
Oy 2 Oy Ou, 4 Oy Our, OY um 
Ox; Ou, Ox; ðu Ox; OuUm OX; 


(2.4.9) 


and so Vxy = AVuy, (2.4.10) 


59 


Calculus 


where A € R’*” is a matrix that contains the derivative of vector u with respect to vector 
x. Thus, evaluating the gradient requires computing a vector—matrix product. This is one 
of the key reasons why linear algebra is such an integral building block in building deep 
learning systems. 


2.4.5 Discussion 


While we have just scratched the surface of a deep topic, a number of concepts already come 
into focus: first, the composition rules for differentiation can be applied routinely, enabling 
us to compute gradients automatically. This task requires no creativity and thus we can 
focus our cognitive powers elsewhere. Second, computing the derivatives of vector-valued 
functions requires us to multiply matrices as we trace the dependency graph of variables 
from output to input. In particular, this graph is traversed in a forward direction when 
we evaluate a function and in a backwards direction when we compute gradients. Later 
chapters will formally introduce backpropagation, a computational procedure for applying 
the chain rule. 


From the viewpoint of optimization, gradients allow us to determine how to move the pa- 
rameters of a model in order to lower the loss, and each step of the optimization algorithms 
used throughout this book will require calculating the gradient. 


2.4.6 Exercises 


1. So far we took the rules for derivatives for granted. Using the definition and limits prove 
the properties for (i) f(x) = c, (ii) f(x) = x”, (iii) f(x) = e* and (iv) f(x) = logx. 


2. In the same vein, prove the product, sum, and quotient rule from first principles. 
3. Prove that the constant multiple rule follows as a special case of the product rule. 
4. Calculate the derivative of f(x) = x*. 


5. What does it mean that f’(x) = 0 for some x? Give an example of a function f and a 
location x for which this might hold. 


6. Plot the function y = f(x) = x? — + and plot its tangent line at x = 1. 
7. Find the gradient of the function f(x) = 3x? +5e™. 
8. What is the gradient of the function f(x) = ||x||2? What happens for x = 0? 


9. Can you write out the chain rule for the case where u = f(x,y,z) and x = x(a,b), 
y = y(a, b), and z = z(a, b)? 


10. Given a function f(x) that is invertible, compute the derivative of its inverse f7! (x). 
Here we have that f~'(f(x)) = x and conversely f(f~!(y)) = y. Hint: use these 
properties in your derivation. 


Discussions *°. 
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2.5 Automatic Differentiation 


Recall from Section 2.4 that calculating derivatives is the crucial step in all the optimization 
algorithms that we will use to train deep networks. While the calculations are straightfor- 
ward, working them out by hand can be tedious and error-prone, and these issues only grow 
as our models become more complex. 


Fortunately all modern deep learning frameworks take this work off our plates by offering 
automatic differentiation (often shortened to autograd). As we pass data through each 
successive function, the framework builds a computational graph that tracks how each value 
depends on others. To calculate derivatives, automatic differentiation works backwards 
through this graph applying the chain rule. The computational algorithm for applying the 
chain rule in this fashion is called backpropagation. 


While autograd libraries have become a hot concern over the past decade, they have a 
long history. In fact the earliest references to autograd date back over half of a century 
(Wengert, 1964). The core ideas behind modern backpropagation date to a PhD thesis 
from 1980 (Speelpenning, 1980) and were further developed in the late 1980s (Griewank, 
1989). While backpropagation has become the default method for computing gradients, 
it is not the only option. For instance, the Julia programming language employs forward 
propagation (Revels et al., 2016). Before exploring methods, let’s first master the autograd 
package. 


import torch 


2.5.1 A Simple Function 


Let’s assume that we are interested in differentiating the function y = 2x' x with respect to 
the column vector x. To start, we assign x an initial value. 


x = torch. arange(4.0) 
x 


tensor([@., 1., 2., 3.]) 


Before we calculate the gradient of y with respect to x, we need a place to store it. In 
general, we avoid allocating new memory every time we take a derivative because deep 
learning requires successively computing derivatives with respect to the same parameters 
a great many times, and we might risk running out of memory. Note that the gradient of 
a scalar-valued function with respect to a vector x is vector-valued with the same shape as 
x. 
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# Can also create x = torch.arange(4.0, requires_grad=True) 
x. requires_grad_(True) 
x.grad # The gradient is None by default 


We now calculate our function of x and assign the result to y. 


y = 2 x torch.dot(x, x) 
y 


tensor(28., grad_fn=<MulBackward@>) 


We can now take the gradient of y with respect to x by calling its backward method. Next, 
we can access the gradient via x’s grad attribute. 


y. backward() 
x. grad 


tensor(L 0., 4., 8., 12.]) 


We already know that the gradient of the function y = 2x'x with respect to x should be 
4x. We can now verify that the automatic gradient computation and the expected result are 
identical. 


x.grad == 4 * x 


tensor(L[True, True, True, True]) 


Now let’s calculate another function of x and take its gradient. Note that PyTorch does not 
automatically reset the gradient buffer when we record a new gradient. Instead, the new 
gradient is added to the already-stored gradient. This behavior comes in handy when we 
want to optimize the sum of multiple objective functions. To reset the gradient buffer, we 
can call x.grad.zero_() as follows: 


x.grad.zero_() # Reset the gradient 
y = x.sum() 

y. backward() 

x. grad 


tensor([1., 1., 1., 1.]) 


2.5.2 Backward for Non-Scalar Variables 


When y is a vector, the most natural representation of the derivative of y with respect 
to a vector x is a matrix called the Jacobian that contains the partial derivatives of each 


62 


Preliminaries 


component of y with respect to each component of x. Likewise, for higher-order y and x, 
the result of differentiation could be an even higher-order tensor. 


While Jacobians do show up in some advanced machine learning techniques, more com- 
monly we want to sum up the gradients of each component of y with respect to the full 
vector x, yielding a vector of the same shape as x. For example, we often have a vector 
representing the value of our loss function calculated separately for each example among a 
batch of training examples. Here, we just want to sum up the gradients computed individ- 
ually for each example. 


Because deep learning frameworks vary in how they interpret gradients of non-scalar ten- 
sors, PyTorch takes some steps to avoid confusion. Invoking backward on a non-scalar 
elicits an error unless we tell PyTorch how to reduce the object to a scalar. More formally, 
we need to provide some vector v such that backward will compute v” ôy rather than 
Oxy. This next part may be confusing, but for reasons that will become clear later, this 
argument (representing v) is named gradient. For a more detailed description, see Yang 
Zhang’s Medium post”” . 


" x. grad.zero_() 


y=x * X 
y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward() 
x. grad 


tensor(LQ., 2., 4., 6.]) 


2.5.3 Detaching Computation 


Sometimes, we wish to move some calculations outside of the recorded computational 
graph. For example, say that we use the input to create some auxiliary intermediate terms 
for which we do not want to compute a gradient. In this case, we need to detach the re- 
spective computational graph from the final result. The following toy example makes this 
clearer: suppose we have z = x * yand y = x * x but we want to focus on the direct 
influence of x on z rather than the influence conveyed via y. In this case, we can create a 
new variable u that takes the same value as y but whose provenance (how it was created) 
has been wiped out. Thus u has no ancestors in the graph and gradients do not flow through 
u to x. For example, taking the gradient of z = x * u will yield the result u, (not 3 * x 
* xas you might have expected since z = x * x * x). 


x.grad.zero_() 
y=x * X 
u = y.detach() 
Z=u* xX 


z.sum() .backward() 
x.grad == 
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tensor(L[True, True, True, True]) 


Note that while this procedure detaches y’s ancestors from the graph leading to z, the com- 
putational graph leading to y persists and thus we can calculate the gradient of y with 
respect to x. 


x.grad.zero_() 
y.sum() .backward() 
X.grad == 2 * x 


tensor([True, True, True, True]) 


2.5.4 Gradients and Python Control Flow 


So far we reviewed cases where the path from input to output was well defined via a func- 
tion such as z = x * x * x. Programming offers us a lot more freedom in how we 
compute results. For instance, we can make them depend on auxiliary variables or condi- 
tion choices on intermediate results. One benefit of using automatic differentiation is that 
even if building the computational graph of a function required passing through a maze 
of Python control flow (e.g., conditionals, loops, and arbitrary function calls), we can still 
calculate the gradient of the resulting variable. To illustrate this, consider the following 
code snippet where the number of iterations of the while loop and the evaluation of the if 
statement both depend on the value of the input a. 


def f(a): 

b=ax2 

while b.norm() < 1000: 
b=b * 2 

if b.sum() > ð: 
c=b 

else: 
c= 100 x b 

return c 


Below, we call this function, passing in a random value, as input. Since the input is a 
random variable, we do not know what form the computational graph will take. However, 
whenever we execute f(a) on a specific input, we realize a specific computational graph 
and can subsequently run backward. 


w 
| 


= torch.randn(size=(), requires_grad=True) 
= f(a) 
d.backward() 


Qa 


Even though our function f is, for demonstration purposes, a bit contrived, its dependence 
on the input is quite simple: it is a linear function of a with piecewise defined scale. As 
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such, f(a) / aisa vector of constant entries and, moreover, f(a) / aneeds to match the 
gradient of f(a) with respect to a. 


a.grad = d / a 


tensor (True) 


Dynamic control flow is very common in deep learning. For instance, when processing 
text, the computational graph depends on the length of the input. In these cases, automatic 
differentiation becomes vital for statistical modeling since it is impossible to compute the 
gradient a priori. 


2.5.5 Discussion 


You have now gotten a taste of the power of automatic differentiation. The development of 
libraries for calculating derivatives both automatically and efficiently has been a massive 
productivity booster for deep learning practitioners, liberating them so they can focus on 
less menial. Moreover, autograd lets us design massive models for which pen and paper 
gradient computations would be prohibitively time consuming. Interestingly, while we use 
autograd to optimize models (in a statistical sense) the optimization of autograd libraries 
themselves (in a computational sense) is a rich subject of vital interest to framework design- 
ers. Here, tools from compilers and graph manipulation are leveraged to compute results 
in the most expedient and memory-efficient manner. 


For now, try to remember these basics: (i) attach gradients to those variables with respect 
to which we desire derivatives; (ii) record the computation of the target value; (iii) execute 
the backpropagation function; and (iv) access the resulting gradient. 


2.5.6 Exercises 
1. Why is the second derivative much more expensive to compute than the first derivative? 


2. After running the function for backpropagation, immediately run it again and see what 
happens. Investigate. 


3. In the control flow example where we calculate the derivative of d with respect to a, 
what would happen if we changed the variable a to a random vector or a matrix? At 
this point, the result of the calculation f(a) is no longer a scalar. What happens to the 
result? How do we analyze this? 


4. Let f(x) = sin(x). Plot the graph of f and of its derivative f’. Do not exploit the fact 
that f'(x) = cos(x) but rather use automatic differentiation to get the result. 
5. Let f(x) = ((log x’) - sinx) +x7!. Write out a dependency graph tracing results from 
x to f(x). 
af 


6. Use the chain rule to compute the derivative =~ of the aforementioned function, placing 
each term on the dependency graph that you constructed previously. 
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7. Given the graph and the intermediate derivative results, you have a number of options 
when computing the gradient. Evaluate the result once starting from x to f and once 
from f tracing back to x. The path from x to f is commonly known as forward differ- 
entiation, whereas the path from f to x is known as backward differentiation. 


8. When might you want to use forward, and when backward, differentiation? Hint: con- 
sider the amount of intermediate data needed, the ability to parallelize steps, and the 
size of matrices and vectors involved. 


Discussions*®. 


2.6 Probability and Statistics 
SSS SS eae 


One way or another, machine learning is all about uncertainty. In supervised learning, we 
want to predict something unknown (the target) given something known (the features). De- 
pending on our objective, we might attempt to predict the most likely value of the target. 
Or we might predict the value with the smallest expected distance from the target. And 
sometimes we wish not only to predict a specific value but to quantify our uncertainty. For 
example, given some features describing a patient, we might want to know how likely they 
are to suffer a heart attack in the next year. In unsupervised learning, we often care about 
uncertainty. To determine whether a set of measurements are anomalous, it helps to know 
how likely one is to observe values in a population of interest. Furthermore, in reinforce- 
ment learning, we wish to develop agents that act intelligently in various environments. 
This requires reasoning about how an environment might be expected to change and what 
rewards one might expect to encounter in response to each of the available actions. 


Probability is the mathematical field concerned with reasoning under uncertainty. Given a 
probabilistic model of some process, we can reason about the likelihood of various events. 
The use of probabilities to describe the frequencies of repeatable events (like coin tosses) is 
fairly uncontroversial. In fact, frequentist scholars adhere to an interpretation of probability 
that applies only to such repeatable events. By contrast Bayesian scholars use the language 
of probability more broadly to formalize reasoning under uncertainty. Bayesian probability 
is characterized by two unique features: (i) assigning degrees of belief to non-repeatable 
events, e.g., what is the probability that a dam will collapse?; and (ii) subjectivity. While 
Bayesian probability provides unambiguous rules for how one should update their beliefs in 
light of new evidence, it allows for different individuals to start off with different prior be- 
liefs. Statistics helps us to reason backwards, starting off with collection and organization 
of data and backing out to what inferences we might draw about the process that generated 
the data. Whenever we analyze a dataset, hunting for patterns that we hope might charac- 
terize a broader population, we are employing statistical thinking. Many courses, majors, 
theses, careers, departments, companies, and institutions have been devoted to the study of 
probability and statistics. While this section only scratches the surface, we will provide the 
foundation that you need to begin building models. 
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%matplotlib inline 

import random 

import torch 

from torch.distributions.multinomial import Multinomial 
from d21 import torch as d21 


2.6.1 A Simple Example: Tossing Coins 


Imagine that we plan to toss a coin and want to quantify how likely we are to see heads 
(vs. tails). If the coin is fair, then both outcomes (heads and tails), are equally likely. 
Moreover if we plan to toss the coin n times then the fraction of heads that we expect to 
see should exactly match the expected fraction of tails. One intuitive way to see this is 
by symmetry: for every possible outcome with np heads and m = (n — np) tails, there is 
an equally likely outcome with n heads and ny tails. Note that this is only possible if on 
average we expect to see 1/2 of tosses come up heads and 1/2 come up tails. Of course, if 
you conduct this experiment many times with n = 1000000 tosses each, you might never 
see a trial where np = m exactly. 


Formally, the quantity 1/2 is called a probability and here it captures the certainty with 
which any given toss will come up heads. Probabilities assign scores between 0 and 1 to 
outcomes of interest, called events. Here the event of interest is heads and we denote the 
corresponding probability P(heads). A probability of 1 indicates absolute certainty (imag- 
ine a trick coin where both sides were heads) and a probability of 0 indicates impossibility 
(e.g., if both sides were tails). The frequencies np/n and n/n are not probabilities but rather 
statistics. Probabilities are theoretical quantities that underly the data generating process. 
Here, the probability 1/2 is a property of the coin itself. By contrast, statistics are empirical 
quantities that are computed as functions of the observed data. Our interests in probabilis- 
tic and statistical quantities are inextricably intertwined. We often design special statistics 
called estimators that, given a dataset, produce estimates of model parameters such as prob- 
abilities. Moreover, when those estimators satisfy a nice property called consistency, our 
estimates will converge to the corresponding probability. In turn, these inferred probabili- 
ties tell about the likely statistical properties of data from the same population that we might 
encounter in the future. 


Suppose that we stumbled upon a real coin for which we did not know the true P(heads). 
To investigate this quantity with statistical methods, we need to (i) collect some data; and 
(ii) design an estimator. Data acquisition here is easy; we can toss the coin many times 
and record all the outcomes. Formally, drawing realizations from some underlying random 
process is called sampling. As you might have guessed, one natural estimator is the ratio 
of the number of observed heads to the total number of tosses. 


Now, suppose that the coin was in fact fair, i.e., P(heads) = 0.5. To simulate tosses of a 
fair coin, we can invoke any random number generator. There are some easy ways to draw 
samples of an event with probability 0.5. For example Python’s random. random yields 
numbers in the interval [0,1] where the probability of lying in any sub-interval [a,b] c 
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[0, 1] is equal to b — a. Thus we can get out @ and 1 with probability @.5 each by testing 
whether the returned float number is greater than Q.5: 


num_tosses = 100 

heads = sum([random.random() > 0.5 for _ in range(num_tosses) ]) 
tails = num_tosses - heads 

print("heads, tails: ", [heads, tails]) 


heads, tails: [44, 56] 


More generally, we can simulate multiple draws from any variable with a finite number 
of possible outcomes (like the toss of a coin or roll of a die) by calling the multinomial 
function, setting the first argument to the number of draws and the second as a list of prob- 
abilities associated with each of the possible outcomes. To simulate ten tosses of a fair coin, 
we assign probability vector [@.5, @.5], interpreting index 0 as heads and index 1 as tails. 
The function returns a vector with length equal to the number of possible outcomes (here, 
2), where the first component tells us the number of occurrences of heads and the second 
component tells us the number of occurrences of tails. 


fair_probs = torch.tensor([0.5, @.5]) 
Multinomial(100, fair_probs) .sample() 


tensor([5@., 50.]) 


Each time you run this sampling process, you will receive a new random value that may 
differ from the previous outcome. Dividing by the number of tosses gives us the frequency 
of each outcome in our data. Note that these frequencies, just like the probabilities that they 
are intended to estimate, sum to 1. 


Multinomial(100, fair_probs).sample() / 100 


tensor(L0.4800, @.5200]) 


Here, even though our simulated coin is fair (we ourselves set the probabilities [0.5, @. 
5]), the counts of heads and tails may not be identical. That is because we only drew a 
relatively small number of samples. If we did not implement the simulation ourselves, and 
only saw the outcome, how would we know if the coin were slightly unfair or if the possible 
deviation from 1/2 was just an artifact of the small sample size? Let’s see what happens 
when we simulate 10,000 tosses. 


counts = Multinomial(10000, fair_probs) .sample() 
counts / 10000 
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tensor(L0.4966, @.5034]) 


In general, for averages of repeated events (like coin tosses), as the number of repetitions 
grows, our estimates are guaranteed to converge to the true underlying probabilities. The 
mathematical formulation of this phenomenon is called the law of large numbers and the 
central limit theorem tells us that in many situations, as the sample size n grows, these 
errors should go down at a rate of (1/1). Let’s get some more intuition by studying how 
our estimate evolves as we grow the number of tosses from 1 to 10,000. 


counts = Multinomial(1, fair_probs) .sample((1000Q,)) 
cum_counts = counts.cumsum(dim=0) 

estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True) 
estimates = estimates.numpy() 


d21.set_figsize((4.5, 3.5)) 

d21.plt.plot(estimates[:, 2], label=("P(coin=heads)”)) 
d21.plt.plot(estimates[:, 1], label=(”P(coin=tails)")) 
d21.plt.axhline(y=0.5, color='black', linestyle='dashed') 
d21.plt.gca().set_xlabel(' Samples’) 
d21.plt.gca().set_ylabel('Estimated probability’) 
d21.plt.legend(); 


— P(coin=heads) 
—— P(coin=tails) 


Estimated probability 


0 2000 4000 6000 8000 10000 
Samples 


Each solid curve corresponds to one of the two values of the coin and gives our estimated 
probability that the coin turns up that value after each group of experiments. The dashed 
black line gives the true underlying probability. As we get more data by conducting more 
experiments, the curves converge towards the true probability. You might already begin to 
see the shape of some of the more advanced questions that preoccupy statisticians: How 
quickly does this convergence happen? If we had already tested many coins manufactured 
at the same plant, how might we incorporate this information? 


2.6.2 A More Formal Treatment 


We have already gotten pretty far: posing a probabilistic model, generating synthetic data, 
running a Statistical estimator, empirically assessing convergence, and reporting error met- 
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rics (checking the deviation). However, to go much further, we will need to be more pre- 
cise. 


When dealing with randomness, we denote the set of possible outcomes S and call it the 
sample space or outcome space. Here, each element is a distinct possible outcome. In 
the case of rolling a single coin, S = {heads, tails}. For a single die, S = {1, 2,3, 4,5, 6}. 


When flipping two coins, possible outcomes are { (heads, heads), (heads, tails), (tails, heads), (tails, tails) }. 


Events are subsets of the sample space. For instance, the event “the first coin toss comes 
up heads” corresponds to the set {(heads, heads), (heads, tails)}. Whenever the outcome 
z of a random experiment satisfies z € A, then event A has occurred. For a single roll 
of a die, we could define the events “seeing a 5” (A = {5}) and “seeing an odd number” 
(8 = {1,3, 5}). In this case, if the die came up 5, we would say that both A and B occurred. 
On the other hand, if z = 3, then A did not occur but $ did. 


A probability function maps events onto real values P : A C S — [0,1]. The probabil- 
ity, denoted P(A), of an event A in the given sample space S, has the following proper- 
ties: 


e The probability of any event A is a nonnegative real number, i.e., P(A) 2 0; 
e The probability of the entire sample space is 1, i.e., P(S) = 1; 


e For any countable sequence of events A1, An, ... that are mutually exclusive (i.e., Ay N 
A; = 0 for alli + j), the probability that any of them happens is equal to the sum of 
their individual probabilities, i.e., P(Uj2, Ai) = X32, P(Ai). 


These axioms of probability theory, proposed by Kolmogorov (1933), can be applied to 
rapidly derive a number of important consequences. For instance, it follows immediately 
that the probability of any event A or its complement A’ occurring is 1 (because AUA’ = 
S). We can also prove that P(@) = 0 because 1 = P(S US’) = P(SU 0) = P(S) + P(O) = 
1+ P(@). Consequently, the probability of any event A and its complement A’ occurring 
simultaneously is (AN A’) = 0. Informally, this tells us that impossible events have zero 
probability of occurring. 


2.6.3 Random Variables 


When we spoke about events like the roll of a die coming up odds or the first coin toss 
coming up heads, we were invoking the idea of a random variable. Formally, random 
variables are mappings from an underlying sample space to a set of (possibly many) values. 
You might wonder how a random variable is different from the sample space, since both are 
collections of outcomes. Importantly, random variables can be much coarser than the raw 
sample space. We can define a binary random variable like “greater than 0.5” even when 
the underlying sample space is infinite, e.g., points on the line segment between 0 and 1. 
Additionally, multiple random variables can share the same underlying sample space. For 
example “whether my home alarm goes off’ and “whether my house was burgled” are both 
binary random variables that share an underlying sample space. Consequently, knowing the 
value taken by one random variable can tell us something about the likely value of another 
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random variable. Knowing that the alarm went off, we might suspect that the house was 
likely burgled. 


Every value taken by a random variable corresponds to a subset of the underlying sample 
space. Thus the occurrence where the random variable X takes value v, denoted by X = v, 
is an event and P(X = v) denotes its probability. Sometimes this notation can get clunky, 
and we can abuse notation when the context is clear. For example, we might use P(X) to 
refer broadly to the distribution of X, i.e., the function that tells us the probability that X 
takes any given value. Other times we write expressions like P(X, Y) = P(X)P(Y), asa 
shorthand to express a statement that is true for all of the values that the random variables 
X and Y can take, i.e., for all i, j it holds that P(X = i and Y = j) = P(X =1)P(Y = j). 
Other times, we abuse notation by writing P(v) when the random variable is clear from the 


context. Since an event in probability theory is a set of outcomes from the sample space, 
we can specify a range of values for a random variable to take. For example, P(1 < X < 3) 
denotes the probability of the event {1 < X < 3}. 


Note that there is a subtle difference between discrete random variables, like flips of a coin 
or tosses of a die, and continuous ones, like the weight and the height of a person sampled 
at random from the population. In this case we seldom really care about someone’s exact 
height. Moreover, if we took precise enough measurements, we would find that no two 
people on the planet have the exact same height. In fact, with fine enough measurements, 
you would never have the same height when you wake up and when you go to sleep. There 
is little point in asking about the exact probability that someone is 1.8013927829 10287192 
meters tall. Instead, we typically care more about being able to say whether someone’s 
height falls into a given interval, say between 1.79 and 1.81 meters. In these cases we work 
with probability densities. The height of exactly 1.80 meters has no probability, but nonzero 
density. To work out the probability assigned to an interval, we must take an integral of the 
density over that interval. 


2.6.4 Multiple Random Variables 


You might have noticed that we could not even make it through the previous section without 
making statements involving interactions among multiple random variables (recall P(X, Y) = 
P(X)P(Y)). Most of machine learning is concerned with such relationships. Here, the sam- 
ple space would be the population of interest, say customers who transact with a business, 
photographs on the Internet, or proteins known to biologists. Each random variable would 
represent the (unknown) value of a different attribute. Whenever we sample an individual 
from the population, we observe a realization of each of the random variables. Because 
the values taken by random variables correspond to subsets of the sample space that could 
be overlapping, partially overlapping, or entirely disjoint, knowing the value taken by one 
random variable can cause us to update our beliefs about which values of another random 
variable are likely. If a patient walks into a hospital and we observe that they are having 
trouble breathing and have lost their sense of smell, then we believe that they are more 
likely to have COVID-19 than we might if they had no trouble breathing and a perfectly 
ordinary sense of smell. 


When working with multiple random variables, we can construct events corresponding to 
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every combination of values that the variables can jointly take. The probability function 
that assigns probabilities to each of these combinations (e.g. A = a and B = b) is called the 
joint probability function and simply returns the probability assigned to the intersection 
of the corresponding subsets of the sample space. The joint probability assigned to the 
event where random variables A and B take values a and b, respectively, is denoted P(A = 
a, B = b), where the comma indicates “and”. Note that for any values a and b, it follows 
that 


P(A =a,B =b) < P(A =a) and P(A =a,B=b) < P(B = b), (2.6.1) 


since for A = a and B = b to happen, A = a has to happen and B = b also has to 
happen. Interestingly, the joint probability tells us all that we can know about these random 
variables in a probabilistic sense, and can be used to derive many other useful quantities, 
including recovering the individual distributions P(A) and P(B). To recover P(A = a) 
we simply sum up P(A = a, B = v) over all values v that the random variable B can take: 
P(A=a) =), P(A=a,B=y). 


The ratio PERED) < 1 turns out to be extremely important. It is called the conditional 
probability, and is denoted via the “|” symbol: 
P(B=b|A=a)=P(A=a,B=b)/P(A=a). (2.6.2) 


It tells us the new probability associated with the event B = b, once we condition on the 
fact A = a took place. We can think of this conditional probability as restricting attention 
only to the subset of the sample space associated with A = a and then renormalizing so that 
all probabilities sum to 1. Conditional probabilities are in fact just ordinary probabilities 
and thus respect all of the axioms, as long as we condition all terms on the same event and 
thus restrict attention to the same sample space. For instance, for disjoint events 8 and 8’, 
we have that P(B U B’ | A=a) = P(B | A=a)+P(B’ | A=a). 


Using the definition of conditional probabilities, we can derive the famous result called 
Bayes’ theorem. By construction, we have that P(A, B) = P(B | A)P(A) and P(A, B) = 
P(A | B)P(B). Combining both equations yields P(B | A)P(A) = P(A | B)P(B) and 
hence 

P(B | A)P(A) 


PRS 


(2.6.3) 
This simple equation has profound implications because it allows us to reverse the order of 
conditioning. If we know how to estimate P(B | A), P(A), and P(B), then we can estimate 
P(A | B). We often find it easier to estimate one term directly but not the other and Bayes’ 
theorem can come to the rescue here. For instance, if we know the prevalence of symptoms 
for a given disease, and the overall prevalences of the disease and symptoms, respectively, 
we can determine how likely someone is to have the disease based on their symptoms. In 
some cases we might not have direct access to P(B), such as the prevalence of symptoms. 
In this case a simplified version of Bayes’ theorem comes in handy: 


P(A | B) œ P(B | A)P(A). (2.6.4) 
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Since we know that P(A | B) must be normalized to 1, i.e., ), P(A = a | B) = 1, we can 
use it to compute 


P(B | A)P(A) 


POD = pp A a) Pay 


(2.6.5) 
In Bayesian statistics, we think of an observer as possessing some (subjective) prior be- 
liefs about the plausibility of the available hypotheses encoded in the prior P(A), and a 
likelihood function that says how likely one is to observe any value of the collected evi- 
dence for each of the hypotheses in the class P(E | H). Bayes’ theorem is then interpreted 
as telling us how to update the initial prior P(H) in light of the available evidence E to 
produce posterior beliefs P(H | E) = eee Informally, this can be stated as “pos- 
terior equals prior times likelihood, divided by the evidence”. Now, because the evidence 
P(E) is the same for all hypotheses, we can get away with simply normalizing over the 
hypotheses. 


Note that Xa P(A = a | B) = 1 also allows us to marginalize over random variables. 
That is, we can drop variables from a joint distribution such as P(A, B). After all, we have 
that 


X, P(B | A =a)P(A = a) = È P(B, A =a) = P(B). (2.6.6) 


Independence is another fundamentally important concept that forms the backbone of many 
important ideas in statistics. In short, two variables are independent if conditioning on the 
value of A does not cause any change to the probability distribution associated with B and 
vice versa. More formally, independence, denoted A L B, requires that P(A | B) = P(A) 
and, consequently, that P(A, B) = P(A | B)P(B) = P(A)P(B). Independence is often 
an appropriate assumption. For example, if the random variable A represents the outcome 
from tossing one fair coin and the random variable B represents the outcome from tossing 
another, then knowing whether A came up heads should not influence the probability of B 
coming up heads. 


Independence is especially useful when it holds among the successive draws of our data 
from some underlying distribution (allowing us to make strong statistical conclusions) or 
when it holds among various variables in our data, allowing us to work with simpler models 
that encode this independence structure. On the other hand, estimating the dependencies 
among random variables is often the very aim of learning. We care to estimate the probabil- 
ity of disease given symptoms specifically because we believe that diseases and symptoms 
are not independent. 


Note that because conditional probabilities are proper probabilities, the concepts of inde- 
pendence and dependence also apply to them. Two random variables A and B are condition- 
ally independent given a third variable C if and only if P(A, B | C) = P(A | C)P(B | ©). 
Interestingly, two variables can be independent in general but become dependent when 
conditioning on a third. This often occurs when the two random variables A and B cor- 
respond to causes of some third variable C. For example, broken bones and lung cancer 
might be independent in the general population but if we condition on being in the hospital 
then we might find that broken bones are negatively correlated with lung cancer. That is 
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because the broken bone explains away why some person is in the hospital and thus lowers 
the probability that they are hospitalized because of having lung cancer. 


And conversely, two dependent random variables can become independent upon condition- 
ing on a third. This often happens when two otherwise unrelated events have a common 
cause. Shoe size and reading level are highly correlated among elementary school students, 
but this correlation disappears if we condition on age. 


2.6.5 An Example 


Let’s put our skills to the test. Assume that a doctor administers an HIV test to a patient. 
This test is fairly accurate and fails only with 1% probability if the patient is healthy but 
reported as diseased, i.e., healthy patients test positive in 1% of cases. Moreover, it never 
fails to detect HIV if the patient actually has it. We use D; € {0, 1} to indicate the diagnosis 
(0 if negative and 1 if positive) and H € {0, 1} to denote the HIV status. 


P(D, =1|#) 0.01 
P(D, =0| H) 0.99 


Conditional probability | H=1 | H=0 
1 
0 


Note that the column sums are all | (but the row sums do not), since they are conditional 
probabilities. Let’s compute the probability of the patient having HIV if the test comes 
back positive, i.e., P(H = 1 | Dı = 1). Intuitively this is going to depend on how common 
the disease is, since it affects the number of false alarms. Assume that the population is 
fairly free of the disease, e.g., P(H = 1) = 0.0015. To apply Bayes’ theorem, we need to 
apply marginalization to determine 


P(D, = 1) =P(D; =1,H =0)+P(D, =1,H =1) 
=P(D, =1|H=0)P(H =0)+P(D,=1|H=1)P(H=1) (2.6.7) 
=0.011485. 


This leads us to 


P(D, =1|H=1)P(H =1) 


P(H=1|D,=1)= O 


= 0.1306. (2.6.8) 


In other words, there is only a 13.06% chance that the patient actually has HIV, despite the 
test being pretty accurate. As we can see, probability can be counterintuitive. What should a 
patient do upon receiving such terrifying news? Likely, the patient would ask the physician 
to administer another test to get clarity. The second test has different characteristics and it 
is not as good as the first one. 


Conditional probability | H=1 | H=0 
P(D2=1]| H) 0.98 | 0.03 
P(D> =0| H) 0.02 | 0.97 
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Unfortunately, the second test comes back positive, too. Let’s calculate the requisite prob- 
abilities to invoke Bayes’ theorem by assuming conditional independence: 


P(D, =1,D. =1| H =0) = P(D; =1| H =0)P(D2 =1| H=0) = 0.0003, 
P(D, =1,D.=1|H=1)=P(D, =1|/H=)P(D2.=1|H=1)= 0.98. 
(2.6.9) 


Now we can apply marginalization to obtain the probability that both tests come back pos- 
itive: 

P(D, =1,D2 = 1) 

= P(Dı =1,D2=1,H =0)+P(D; =1, D2 =1,H = 1) 

= P(D,; =1,D.=1|H =0)P(H =0)+ P(D, =1,D2. =1| H=1)P(H =1) 

= 0.00176955. 


(2.6.10) 
Finally, the probability of the patient having HIV given that both tests are positive is 


P(D, =1,D2=1|H=1)P(H=1) 


P(H =1|D,;=1,D2=1)= P(D, = 1, D2 = 1) 


= 0.8307. (2.6.11) 


That is, the second test allowed us to gain much higher confidence that not all is well. De- 
spite the second test being considerably less accurate than the first one, it still significantly 
improved our estimate. The assumption of both tests being conditionally independent of 
each other was crucial for our ability to generate a more accurate estimate. Take the ex- 
treme case where we run the same test twice. In this situation we would expect the same 
outcome both times, hence no additional insight is gained from running the same test again. 
The astute reader might have noticed that the diagnosis behaved like a classifier hiding in 
plain sight where our ability to decide whether a patient is healthy increases as we obtain 
more features (test outcomes). 


2.6.6 Expectations 


Often, making decisions requires not just looking at the probabilities assigned to individ- 
ual events but composing them together into useful aggregates that can provide us with 
guidance. For example, when random variables take continuous scalar values, we often 
care about knowing what value to expect on average. This quantity is formally called an 
expectation. If we are making investments, the first quantity of interest might be the return 
we can expect, averaging over all the possible outcomes (and weighting by the appropri- 
ate probabilities). For instance, say that with 50% probability, an investment might fail 
altogether, with 40% probability it might provide a 2x return, and with 10% probability 
it might provide a 10x return 10x. To calculate the expected return, we sum over all re- 
turns, multiplying each by the probability that they will occur. This yields the expectation 
0.5-0+0.4-2+0.1-10 = 1.8. Hence the expected return is 1.8x. 


In general, the expectation (or average) of the random variable X is defined as 


E[X] = Ex~p[x] = X xP(X =x). (2.6.12) 


x 
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Likewise, for densities we obtain E[X] = ji x dp(x). Sometimes we are interested in the 
expected value of some function of x. We can calculate these expectations as 


Ex-pLf(a)] = X FOPO) and Ey~p[f(2)] = : F(x)p(x) dx (2.6.13) 


for discrete probabilities and densities, respectively. Returning to the investment exam- 
ple from above, f might be the utility (happiness) associated with the return. Behavior 
economists have long noted that people associate greater disutility with losing money than 
the utility gained from earning one dollar relative to their baseline. Moreover, the value 
of money tends to be sub-linear. Possessing 100k dollars versus zero dollars can make the 
difference between paying the rent, eating well, and enjoying quality healthcare versus suf- 
fering through homelessness. On the other hand, the gains due to possessing 200k versus 
100k are less dramatic. Reasoning like this motivates the cliché that “the utility of money 
is logarithmic”. 


If the utility associated with a total loss were —1, and the utilities associated with returns of 
1, 2, and 10 were 1, 2 and 4, respectively, then the expected happiness of investing would 
be 0.5 - (—1) +0.4-2+0.1-4 = 0.7 (an expected loss of utility of 30%). If indeed this were 
your utility function, you might be best off keeping the money in the bank. 


For financial decisions, we might also want to measure how risky an investment is. Here, we 
care not just about the expected value but how much the actual values tend to vary relative 
to this value. Note that we cannot just take the expectation of the difference between the 
actual and expected values. This is because the expectation of a difference is the difference 
of the expectations, i.e., E[X — E[X]] = E[X] - E[E[X]] = 0. However, we can look at 
the expectation of any non-negative function of this difference. The variance of a random 
variable is calculated by looking at the expected value of the squared differences: 


Var[X] = E [(X - E[X])?] = E[X?] - E[X]?. (2.6.14) 


Here the equality follows by expanding (X — E[X])* = X? - 2XE[X] + E[X]? and taking 
expectations for each term. The square root of the variance is another useful quantity called 
the standard deviation. While this and the variance convey the same information (either can 
be calculated from the other), the standard deviation has the nice property that it is expressed 
in the same units as the original quantity represented by the random variable. 


Lastly, the variance of a function of a random variable is defined analogously as 


Vary pf (x)] = ExpL O) - ExpL. (2.6.15) 


Returning to our investment example, we can now compute the variance of the investment. 
It is given by 0.5-0+0.4-27+0.1 - 10? — 1.8? = 8.36. For all intents and purposes this 
is a risky investment. Note that by mathematical convention mean and variance are often 
referenced as u and a. This is particularly the case whenever we use it to parametrize a 
Gaussian distribution. 


In the same way as we introduced expectations and variance for scalar random variables, 
we can do so for vector-valued ones. Expectations are easy, since we can apply them el- 


P X def Á ; 
ementwise. For instance, y = Ey p[x] has coordinates u; = Ex~p[x;:]. Covariances 
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are more complicated. We define them by taking expectations of the outer product of the 
difference between random variables and their mean: 


E Ž Covy-p[x] = Exp [(x- p)(x- pw"). (2.6.16) 


This matrix & is referred to as the covariance matrix. An easy way to see its effect is to 
consider some vector v of the same size as x. It follows that 


viiv=Ex.p [v(x — p)(x- p)'v| = Var,~p[v x]. (2.6.17) 


As such, & allows us to compute the variance for any linear function of x by a simple 
matrix multiplication. The off-diagonal elements tell us how correlated the coordinates 
are: a value of 0 means no correlation, where a larger positive value means that they are 
more strongly correlated. 


2.6.7 Discussion 


In machine learning, there are many things to be uncertain about! We can be uncertain 
about the value of a label given an input. We can be uncertain about the estimated value of 
a parameter. We can even be uncertain about whether data arriving at deployment is even 
from the same distribution as the training data. 


By aleatoric uncertainty, we mean uncertainty that is intrinsic to the problem, and due to 
genuine randomness unaccounted for by the observed variables. By epistemic uncertainty, 
we mean uncertainty over a model’s parameters, the sort of uncertainty that we can hope 
to reduce by collecting more data. We might have epistemic uncertainty concerning the 
probability that a coin turns up heads, but even once we know this probability, we are left 
with aleatoric uncertainty about the outcome of any future toss. No matter how long we 
watch someone tossing a fair coin, we will never be more or less than 50% certain that 
the next toss will come up heads. These terms come from mechanical modeling, (see e.g., 
Der Kiureghian and Ditlevsen (2009) for a review on this aspect of uncertainty quantifica- 
tion®®), It is worth noting, however, that these terms constitute a slight abuse of language. 
The term epistemic refers to anything concerning knowledge and thus, in the philosophical 


` sense, all uncertainty is epistemic. 


We saw that sampling data from some unknown probability distribution can provide us with 
information that can be used to estimate the parameters of the data generating distribution. 
That said, the rate at which this is possible can be quite slow. In our coin tossing example 
(and many others) we can do no better than to design estimators that converge at a rate of 
1/./n, where n is the sample size (e.g., the number of tosses). This means that by going 
from 10 to 1000 observations (usually a very achievable task) we see a tenfold reduction of 
uncertainty, whereas the next 1000 observations help comparatively little, offering only a 
1.41 times reduction. This is a persistent feature of machine learning: while there are often 
easy gains, it takes a very large amount of data, and often with it an enormous amount of 
computation, to make further gains. For an empirical review of this fact for large scale 
language models see Revels et al. (2016). 


We also sharpened our language and tools for statistical modeling. In the process of that 
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we learned about conditional probabilities and about one of the most important equations 
in statistics—Bayes’ theorem. It is an effective tool for decoupling information conveyed 
by data through a likelihood term P(B | A) that addresses how well observations B match 
a choice of parameters A, and a prior probability P(A) which governs how plausible a par- 
ticular choice of A was in the first place. In particular, we saw how this rule can be applied 
to assign probabilities to diagnoses, based on the efficacy of the test and the prevalence of 
the disease itself (i.e., our prior). 


Lastly, we introduced a first set of nontrivial questions about the effect of a specific proba- 
bility distribution, namely expectations and variances. While there are many more than just 
linear and quadratic expectations for a probability distribution, these two already provide 
a good deal of knowledge about the possible behavior of the distribution. For instance, 
Chebyshev’s inequality °° states that P(|X — u| > ko) < 1/k*, where u is the expecta- 
tion, a? is the variance of the distribution, and k > 1 is a confidence parameter of our 
choosing. It tells us that draws from a distribution lie with at least 50% probability within 
a [-V2o, V2c] interval centered on the expectation. 


2.6.8 Exercises 


1. Give an example where observing more data can reduce the amount of uncertainty about 
the outcome to an arbitrarily low level. 


2. Give an example where observing more data will only reduce the amount of uncertainty 
up to a point and then no further. Explain why this is the case and where you expect this 
point to occur. 


3. We empirically demonstrated convergence to the mean for the toss of a coin. Calculate 
the variance of the estimate of the probability that we see a head after drawing n samples. 


1. How does the variance scale with the number of observations? 
2. Use Chebyshev’s inequality to bound the deviation from the expectation. 
3. How does it relate to the central limit theorem? 


4. Assume that we draw m samples x; from a probability distribution with zero mean and 


; ; def 
unit variance. Compute the averages Zm = m! >) Xi. Can we apply Chebyshev’s 


inequality for every zm independently? Why not? 


5. Given two events with probability P(A) and P(B), compute upper and lower bounds 
on P(A U 8) and P(A N 8). Hint: graph the situation using a Venn diagram®?. 


=H 6. Assume that we have a sequence of random variables, say A, B, and C, where B only de- 


pends on A, and C only depends on B, can you simplify the joint probability P(A, B, C)? 
Hint: this is a Markov chain®?. 


igre 7. In Section 2.6.5, assume that the outcomes of the two tests are not independent. In 


particular assume that either test on its own has a false positive rate of 10% and a false 
negative rate of 1%. That is, assume that P(D = 1 | H = 0) = 0.1 and that P(D = 
0 | H = 1) = 0.01. Moreover, assume that for H = 1 (infected) the test outcomes are 
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conditionally independent, i.e., that P(D,, D2 | H = 1) = P(D; | H = 1)P(D2 | H = 
1) but that for healthy patients the outcomes are coupled via P(D; = D2 = 1 | H= 
0) = 0.02. 


1. Work out the joint probability table for Dı and D2, given H = 0 based on the infor- 
mation you have so far. 


2. Derive the probability that the patient is diseased (H = 1) after one test returns 
positive. You can assume the same baseline probability P(H = 1) = 0.0015 as 
before. 


3. Derive the probability that the patient is diseased (H = 1) after both tests return 
positive. 


8. Assume that you are an asset manager for an investment bank and you have a choice of 
stocks s; to invest in. Your portfolio needs to add up to 1 with weights æ; for each stock. 
The stocks have an average return u = Es~p [s] and covariance X = Cov,~p[s]. 


1. Compute the expected return for a given portfolio œ. 


2. If you wanted to maximize the return of the portfolio, how should you choose your 
investment? 


3. Compute the variance of the portfolio. 


4. Formulate an optimization problem of maximizing the return while keeping the vari- 
ance constrained to an upper bound. This is the Nobel-Prize winning Markovitz port- 
folio? (Mangram, 2013). To solve it you will need a quadratic programming solver, 
something way beyond the scope of this book. 


Discussions “*. 
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2.7 Documentation 
E) 


While we cannot possibly introduce every single PyTorch function and class (and the infor- 
mation might become outdated quickly), the API documentation® and additional tutorials 
66 and examples provide such documentation. This section provides some guidance for 
" how to explore the PyTorch API. 


import torch 


2.7.1 Functions and Classes in a Module 


To know which functions and classes can be called in a module, we invoke the dir func- 
tion. For instance, we can query all properties in the module for generating random num- 
bers: 
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print(dir(torch. distributions) ) 


[’AbsTransform’, 'AffineTransform’, ‘Bernoulli’, ‘Beta’, ‘Binomial’, 

«+ 'CatTransform’, ‘Categorical’, ‘Cauchy’, 'Chi2', 'ComposeTransform', 

«+ 'ContinuousBernoulli', 'CorrCholeskyTransform’ , 

«+ 'CumulativeDistributionTransform', ‘Dirichlet’, ‘Distribution’, 'ExpTransform 
«', ‘Exponential’, 'ExponentialFamily’, 'FisherSnedecor', ‘Gamma’, 'Geometric 
~', ‘Gumbel’, 'HalfCauchy', 'HalfNormal’, ‘Independent’, 'IndependentTransform 
~', 'Kumaraswamy’, 'LKJCholesky', ‘Laplace’, 'LogNormal', 'LogisticNormal’, 
«'LowRankMultivariateNormal’, 'LowerCholeskyTransform’, 'MixtureSameFamily’, 
—'Multinomial’, 'MultivariateNormal', 'NegativeBinomial’, ‘Normal’, 

«+ 'OneHotCategorical’, 'OneHotCategoricalStraightThrough', ‘Pareto’, ‘Poisson’, 
«= 'PositiveDefiniteTransform', 'PowerTransform’, 'RelaxedBernoulli’, 
«'RelaxedOneHotCategorical’, '‘ReshapeTransform’, 'SigmoidTransform’ , 

«+ 'SoftmaxTransform', 'SoftplusTransform’, 'StackTransform’, 

«+ 'StickBreakingTransform', 'StudentT', 'TanhTransform', ‘Transform’, 
~'TransformedDistribution’, ‘Uniform’, 'VonMises’, ‘Weibull’, ‘Wishart’, ' 
sall__', '__builtins__’, '__cached__', '__doc__', '__file__', '__loader__', ' 
«+_name__', '__package__’, '__path__’, '__spec__’, ‘bernoulli’, ‘beta’, 
~'biject_to’, ‘binomial’, ‘categorical’, ‘cauchy’, 'chi2', 'constraint_ 
registry’, ‘constraints’, 'continuous_bernoulli’, '‘dirichlet’, ‘distribution 
=’, 'exp_family’, ‘exponential’, 'fishersnedecor’, ‘gamma’, ‘geometric’, 

+ 'gumbel', ‘'half_cauchy’, 'half_normal’, ‘identity_transform', ‘independent’, 
o'kl’, 'kl_divergence’, 'kumaraswamy', 'laplace’, 'lkj_cholesky', 'log_normal 
=’, '‘logistic_normal’, 'lowrank_multivariate_normal’, 'mixture_same_family’, 
‘multinomial’, ‘'multivariate_normal', 'negative_binomial’, ‘normal’, ‘one_ 
shot_categorical’, 'pareto’, ‘poisson’, 'register_kl’, 'relaxed_bernoulli', 

«+ 'relaxed_categorical', 'studentT', 'transform_to’, 'transformed_distribution 
o', ‘transforms’, ‘uniform’, ‘utils’, 'von_mises', ‘weibull’, '‘wishart’] 


Generally, we can ignore functions that start and end with __ (special objects in Python) or 
functions that start with a single _(usually internal functions). Based on the remaining func- 
tion or attribute names, we might hazard a guess that this module offers various methods for 
generating random numbers, including sampling from the uniform distribution (uniform), 
normal distribution (normal), and multinomial distribution (multinomial). 


2.7.2 Specific Functions and Classes 


For specific instructions on how to use a given function or class, we can invoke the help 
function. As an example, let’s explore the usage instructions for tensors’ ones function. 


help(torch. ones) 


Help on built-in function ones in module torch: 


ones(...) 


ones(*size, *, out=None, dtype=None, layout=torch.strided, device=None, 


> requires_grad=False) -> Tensor 


Returns a tensor filled with the scalar value 1, with the shape defined 
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by the variable argument size. 


Args: 
size (int...): a sequence of integers defining the shape of the. 
output tensor. 
Can be a variable number of arguments or a collection like a. 
«list or tuple. 


Keyword arguments: 
out (Tensor, optional): the output tensor. 
dtype (torch.dtype, optional): the desired data type of returned. 
«tensor. 
Default: if None, uses a global default (see torch.set_default_ 
~tensor_type()). 
layout (torch.layout, optional): the desired layout of returned., 
~—Tensor. 
Default: torch.strided. 
device (torch.device, optional): the desired device of returned. 
tensor. 
Default: if None, uses the current device for the default tensor. 
type 
(see torch.set_default_tensor_type()). device will be the CPU 
for CPU tensor types and the current CUDA device for CUDA tensor. 
types. 
requires_grad (bool, optional): If autograd should record operations., 
on the 
returned tensor. Default: False. 


Example: : 
>>> torch.ones(2, 3) 
tensor([L 1., 1., 1.] 


Eas Isy 141l) 


>>> torch.ones(5) 
tensor([ 1., 1., 1., 1., 1.]) 


From the documentation, we can see that the ones function creates a new tensor with the 
specified shape and sets all the elements to the value of 1. Whenever possible, you should 


run a quick test to confirm your interpretation: 


torch. ones(4) 


tensor([1., 1., 1., 1.]) 
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In the Jupyter notebook, we can use ? to display the document in another window. For 
example, list? will create content that is almost identical to help(list), displaying it 
in a new browser window. In addition, if we use two question marks, such as list??, the 
Python code implementing the function will also be displayed. 


The official documentation provides plenty of descriptions and examples that are beyond 
this book. We emphasize important use cases that will get you started quickly with prac- 
tical problems, rather than completeness of coverage. We also encourage you to study the 
source code of the libraries to see examples of high-quality implementations of production 
code. By doing this you will become a better engineer in addition to becoming a better 
scientist. 


Discussions ® . 


82 


Linear Neural Networks for Regression 


Before we worry about making our neural networks deep, it will be helpful to implement 
some shallow ones, for which the inputs connect directly to the outputs. This will prove im- 
portant for a few reasons. First, rather than getting distracted by complicated architectures, 
we can focus on the basics of neural network training, including parametrizing the output 
layer, handling data, specifying a loss function, and training the model. Second, this class 
of shallow networks happens to comprise the set of linear models, which subsumes many 
classical methods of statistical prediction, including linear and softmax regression. Un- 
derstanding these classical tools is pivotal because they are widely used in many contexts 
and we will often need to use them as baselines when justifying the use of fancier archi- 
tectures. This chapter will focus narrowly on linear regression and the next one will extend 
our modeling repertoire by developing linear neural networks for classification. 


3.1 Linear Regression 
E] 


Regression problems pop up whenever we want to predict a numerical value. Common ex- 
amples include predicting prices (of homes, stocks, etc.), predicting the length of stay (for 
patients in the hospital), forecasting demand (for retail sales), among numerous others. Not 
every prediction problem is one of classical regression. Later on, we will introduce classifi- 
cation problems, where the goal is to predict membership among a set of categories. 


As arunning example, suppose that we wish to estimate the prices of houses (in dollars) 
based on their area (in square feet) and age (in years). To develop a model for predicting 
house prices, we need to get our hands on data, including the sales price, area, and age for 
each home. In the terminology of machine learning, the dataset is called a training dataset 
or training set, and each row (containing the data corresponding to one sale) is called an 
example (or data point, instance, sample). The thing we are trying to predict (price) is 
called a label (or target). The variables (age and area) upon which the predictions are 
based are called features (or covariates). 


%matplotlib inline 
import math 
import time 
import numpy as np 


(continues on next page) 
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(continued from previous page) 


import torch 
from d21 import torch as d21 


3.1.1 Basics 


Linear regression is both the simplest and most popular among the standard tools for tack- 
ling regression problems. Dating back to the dawn of the 19th century (Gauss, 1809, Leg- 
endre, 1805), linear regression flows from a few simple assumptions. First, we assume that 
the relationship between features x and target y is approximately linear, i.e., that the con- 
ditional mean E[Y | X = x] can be expressed as a weighted sum of the features x. This 
setup allows that the target value may still deviate from its expected value on account of 
observation noise. Next, we can impose the assumption that any such noise is well behaved, 
following a Gaussian distribution. Typically, we will use n to denote the number of exam- 
ples in our dataset. We use superscripts to enumerate samples and targets, and subscripts 
to index coordinates. More concretely, x“’) denotes the i™ sample and ae denotes its j™® 
coordinate. 


Model 


At the heart of every solution is a model that describes how features can be transformed 
into an estimate of the target. The assumption of linearity means that the expected value of 
the target (price) can be expressed as a weighted sum of the features (area and age): 


price = Warea : area + Wage - age + b. (3.1.1) 


Here Warea and Wage are called weights, and b is called a bias (or offset or intercept). The 
weights determine the influence of each feature on our prediction. The bias determines the 
value of the estimate when all features are zero. Even though we will never see any newly- 
built homes with precisely zero area, we still need the bias because it allows us to express 
all linear functions of our features (rather than restricting us to lines that pass through the 
origin). Strictly speaking, (3.1.1) is an affine transformation of input features, which is 
characterized by a linear transformation of features via a weighted sum, combined with a 
translation via the added bias. Given a dataset, our goal is to choose the weights w and 
the bias b that, on average, make our model’s predictions fit the true prices observed in the 
data as closely as possible. 


In disciplines where it is common to focus on datasets with just a few features, explicitly 
expressing models long-form, as in (3.1.1), is common. In machine learning, we usually 
work with high-dimensional datasets, where it is more convenient to employ compact lin- 
ear algebra notation. When our inputs consist of d features, we can assign each an index 
(between 1 and d) and express our prediction } (in general the “hat” symbol denotes an 
estimate) as 


F=wypxy +: +wgxgt b. (3.1.2) 
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Collecting all features into a vector x € R and all weights into a vector w € Rf, we can 
express our model compactly via the dot product between w and x: 


Saw'xtb. (3.1.3) 


In (3.1.3), the vector x corresponds to the features of a single example. We will often 
find it convenient to refer to features of our entire dataset of n examples via the design 
matrix X € R"*4, Here, X contains one row for every example and one column for every 
feature. For a collection of features X, the predictions ¥ € R” can be expressed via the 
matrix—vector product: 


y=Xw+b, (3.1.4) 


where broadcasting (Section 2.1.4) is applied during the summation. Given features of a 
training dataset X and corresponding (known) labels y, the goal of linear regression is to 
find the weight vector w and the bias term b such that, given features of a new data example 
sampled from the same distribution as X, the new example’s label will (in expectation) be 
predicted with the smallest error. 


Even if we believe that the best model for predicting y given x is linear, we would not 
expect to find a real-world dataset of n examples where y“’) exactly equals w'x" + b 
for all 1 < i < n. For example, whatever instruments we use to observe the features X 
and labels y, there might be a small amount of measurement error. Thus, even when we 
are confident that the underlying relationship is linear, we will incorporate a noise term to 
account for such errors. 


Before we can go about searching for the best parameters (or model parameters) w and b, 
we will need two more things: (i) a measure of the quality of some given model; and (ii) a 
procedure for updating the model to improve its quality. 


Loss Function 


Naturally, fitting our model to the data requires that we agree on some measure of fitness 
(or, equivalently, of unfitness). Loss functions quantify the distance between the real and 
predicted values of the target. The loss will usually be a nonnegative number where smaller 
values are better and perfect predictions incur a loss of 0. For regression problems, the most 
common loss function is the squared error. When our prediction for an example i is $“) 
and the corresponding true label is y“, the squared error is given by: 


; 1 \\2 
1 (wb) = 5 (5 -y) ; (3.1.5) 


The constant 5 makes no real difference but proves to be notationally convenient, since it 
cancels out when we take the derivative of the loss. Because the training dataset is given 
to us, and thus is out of our control, the empirical error is only a function of the model 
parameters. In Fig. 3.1.1, we visualize the fit of a linear regression model in a problem 
with one-dimensional inputs. 


Note that large differences between estimates $C) and targets y® lead to even larger contri- 
butions to the loss, due to its quadratic form (this quadraticity can be a double-edge sword; 
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Fitting a linear regression model to one-dimensional data. 


while it encourages the model to avoid large errors it can also lead to excessive sensitivity 
to anomalous data). To measure the quality of a model on the entire dataset of n examples, 
we simply average (or equivalently, sum) the losses on the training set: 
L(w, b) = 1S. 00 b) = LS (wx 45-90)’, (3.1.6) 
ei ' mi 2 o 
When training the model, we seek parameters (w*, b*) that minimize the total loss across 
all training examples: 
w“, b* = argmin L(w, b). (3.1.7) 


w,b 


Analytic Solution 


Unlike most of the models that we will cover, linear regression presents us with a surpris- 
ingly easy optimization problem. In particular, we can find the optimal parameters (as 
assessed on the training data) analytically by applying a simple formula as follows. First, 
we can subsume the bias b into the parameter w by appending a column to the design ma- 
trix consisting of all 1s. Then our prediction problem is to minimize ||y — Xw||?. As long 
as the design matrix X has full rank (no feature is linearly dependent on the others), then 
there will be just one critical point on the loss surface and it corresponds to the minimum 
of the loss over the entire domain. Taking the derivative of the loss with respect to w and 
setting it equal to zero yields: 


Ow lly — Xw]|? = 2X7 (Xw - y) = 0 and hence X'y = X"Xw. (3.1.8) 


Solving for w provides us with the optimal solution for the optimization problem. Note 
that this solution 


w* =(X'X) !X'y (3.1.9) 


will only be unique when the matrix X' X is invertible, i.e., when the columns of the design 
matrix are linearly independent (Golub and Van Loan, 1996). 


While simple problems like linear regression may admit analytic solutions, you should 
not get used to such good fortune. Although analytic solutions allow for nice mathematical 
analysis, the requirement of an analytic solution is so restrictive that it would exclude almost 
all exciting aspects of deep learning. 
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Minibatch Stochastic Gradient Descent 


Fortunately, even in cases where we cannot solve the models analytically, we can still of- 
ten train models effectively in practice. Moreover, for many tasks, those hard-to-optimize 
models turn out to be so much better that figuring out how to train them ends up being well 
worth the trouble. 


The key technique for optimizing nearly every deep learning model, and which we will 
call upon throughout this book, consists of iteratively reducing the error by updating the 
parameters in the direction that incrementally lowers the loss function. This algorithm is 
called gradient descent. 


The most naive application of gradient descent consists of taking the derivative of the loss 
function, which is an average of the losses computed on every single example in the dataset. 
In practice, this can be extremely slow: we must pass over the entire dataset before making 
a single update, even if the update steps might be very powerful (Liu and Nocedal, 1989). 
Even worse, if there is a lot of redundancy in the training data, the benefit of a full update 
is limited. 


The other extreme is to consider only a single example at a time and to take update steps 
based on one observation at a time. The resulting algorithm, stochastic gradient descent 
(SGD) can be an effective strategy (Bottou, 2010), even for large datasets. Unfortunately, 
SGD has drawbacks, both computational and statistical. One problem arises from the fact 
that processors are a lot faster multiplying and adding numbers than they are at moving data 
from main memory to processor cache. It is up to an order of magnitude more efficient 
to perform a matrix—vector multiplication than a corresponding number of vector—vector 
operations. This means that it can take a lot longer to process one sample at a time compared 
to a full batch. A second problem is that some of the layers, such as batch normalization 
(to be described in Section 8.5), only work well when we have access to more than one 
observation at a time. 


The solution to both problems is to pick an intermediate strategy: rather than taking a full 
batch or only a single sample at a time, we take a minibatch of observations (Li et al., 2014). 
The specific choice of the size of the said minibatch depends on many factors, such as the 
amount of memory, the number of accelerators, the choice of layers, and the total dataset 
size. Despite all that, a number between 32 and 256, preferably a multiple of a large power 
of 2, is a good start. This leads us to minibatch stochastic gradient descent. 


In its most basic form, in each iteration t, we first randomly sample a minibatch $, consist- 
ing of a fixed number || of training examples. We then compute the derivative (gradient) 
of the average loss on the minibatch with respect to the model parameters. Finally, we mul- 
tiply the gradient by a predetermined small positive value 7, called the learning rate, and 
subtract the resulting term from the current parameter values. We can express the update 
as follows: 


(w, b) = (w, b) - A >, Sco.) (w, b). (3.1.10) 


ie B, 


In summary, minibatch SGD proceeds as follows: (i) initialize the values of the model 
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parameters, typically at random; (ii) iteratively sample random minibatches from the data, 
updating the parameters in the direction of the negative gradient. For quadratic losses and 
affine transformations, this has a closed-form expansion: 


1 (i) Z 1 of TO) (i) 
wW e Ww- — Owl’ (w,b) =w- — xX’ (w x +b-y 
|B| 2 


i IB] ; 
1 <2: z g oe A r (3.1.11) 
beb- Sal (w,b) =b- (w'x +b- i; 
ii 24 BI 24 i 


Since we pick a minibatch 8 we need to normalize by its size |B|. Frequently minibatch 
size and learning rate are user-defined. Such tunable parameters that are not updated in the 
training loop are called hyperparameters. They can be tuned automatically by a number 
of techniques, such as Bayesian optimization (Frazier, 2018). In the end, the quality of the 
solution is typically assessed on a separate validation dataset (or validation set). 


After training for some predetermined number of iterations (or until some other stopping 
criterion is met), we record the estimated model parameters, denoted W, Ê. Note that even if 
our function is truly linear and noiseless, these parameters will not be the exact minimizers 
of the loss, nor even deterministic. Although the algorithm converges slowly towards the 
minimizers it typically will not find them exactly in a finite number of steps. Moreover, 
the minibatches 8 used for updating the parameters are chosen at random. This breaks 
determinism. 


Linear regression happens to be a learning problem with a global minimum (whenever X 
is full rank, or equivalently, whenever XTX is invertible). However, the loss surfaces for 
deep networks contain many saddle points and minima. Fortunately, we typically do not 
care about finding an exact set of parameters but merely any set of parameters that leads 
to accurate predictions (and thus low loss). In practice, deep learning practitioners seldom 
struggle to find parameters that minimize the loss on training sets (Frankle and Carbin, 
2018, Izmailov et al., 2018). The more formidable task is to find parameters that lead 
to accurate predictions on previously unseen data, a challenge called generalization. We 
return to these topics throughout the book. 


Predictions 


Given the model W™x + 6, we can now make predictions for a new example, e.g., pre- 
dicting the sales price of a previously unseen house given its area x; and age x2. Deep 
learning practitioners have taken to calling the prediction phase inference but this is a bit of 
a misnomer—inference refers broadly to any conclusion reached on the basis of evidence, 
including both the values of the parameters and the likely label for an unseen instance. If 
anything, in the statistics literature inference more often denotes parameter inference and 
this overloading of terminology creates unnecessary confusion when deep learning prac- 
titioners talk to statisticians. In the following we will stick to prediction whenever possi- 
ble. 


3.1.2 Vectorization for Speed 
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When training our models, we typically want to process whole minibatches of examples si- 
multaneously. Doing this efficiently requires that we vectorize the calculations and leverage 
fast linear algebra libraries rather than writing costly for-loops in Python. 


To see why this matters so much, let’s consider two methods for adding vectors. To start, we 
instantiate two 10,000-dimensional vectors containing all 1s. In the first method, we loop 
over the vectors with a Python for-loop. In the second, we rely on a single call to +. 


10000 
torch. ones(n) 
torch.ones(n) 


oo 5 
TU 


Now we can benchmark the workloads. First, we add them, one coordinate at a time, using 
a for-loop. 


c = torch.zeros(n) 
t = time. time() 
for i in range(n): 
cli] = ali] + bli] 
f'{time.time() - t:.5f} sec’ 


"@.17802 sec’ 


Alternatively, we rely on the reloaded + operator to compute the elementwise sum. 


t = time. time() 
d=at+b 
f'{time.time() - t:.5f} sec’ 


"@.00036 sec’ 


The second method is dramatically faster than the first. Vectorizing code often yields order- 
of-magnitude speedups. Moreover, we push more of the mathematics to the library so we 
do not have to write as many calculations ourselves, reducing the potential for errors and 
increasing portability of the code. 


3.1.3 The Normal Distribution and Squared Loss 


So far we have given a fairly functional motivation of the squared loss objective: the optimal 
parameters return the conditional expectation E[Y | X] whenever the underlying pattern 
is truly linear, and the loss assigns large penalties for outliers. We can also provide a more 
formal motivation for the squared loss objective by making probabilistic assumptions about 
the distribution of noise. 


Linear regression was invented at the turn of the 19th century. While it has long been 
debated whether Gauss or Legendre first thought up the idea, it was Gauss who also dis- 
covered the normal distribution (also called the Gaussian). It turns out that the normal 
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distribution and linear regression with squared loss share a deeper connection than com- 
mon parentage. 


To begin, recall that a normal distribution with mean u and variance g? (standard deviation 
o) is given as 


D(x) = = exp (- (x - m?) i (3.1.12) 


2 
no? 20 


Below we define a function to compute the normal distribution. 


def normal(x, mu, sigma): 
p = 1 / math.sqrt(2 * math.pi * sigma*x*2) 
return p * np.exp(-@.5 x (x - mu)**2 / sigmax*2) 


We can now visualize the normal distributions. 


# Use NumPy again for visualization 
x = np.arange(-7, 7, 0.01) 


# Mean and standard deviation pairs 

params = [(@, 1), (@, 2), (3, 1)] 

d21.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel=’x’', 
ylabel='p(x)’, figsize=(4.5, 2.5), 
legend=[f'mean {mu}, std {sigma}'’ for mu, sigma in params]) 


= mean 0, std 1 


=-=- mean 0, std 2 
| —:= mean 3, std 1 


Note that changing the mean corresponds to a shift along the x-axis, and increasing the 
variance spreads the distribution out, lowering its peak. 


One way to motivate linear regression with squared loss is to assume that observations arise 
from noisy measurements, where the noise e follows the normal distribution N (0, o°): 


y =w'x+b+e where e ~ N(0, 0°). (3.1.13) 


Thus, we can now write out the likelihood of seeing a particular y for a given x via 


1 1 
P(y |x) = mm ~ za ly w"x - by” (3.1.14) 
TO 


As such, the likelihood factorizes. According to the principle of maximum likelihood, the 
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best values of parameters w and b are those that maximize the likelihood of the entire 
dataset: 


P(y| X)=] [po |x). (3.1.15) 
i=l 


The equality follows since all pairs (x), y?) were drawn independently of each other. Es- 
timators chosen according to the principle of maximum likelihood are called maximum like- 
lihood estimators. While, maximizing the product of many exponential functions, might 
look difficult, we can simplify things significantly, without changing the objective, by max- 
imizing the logarithm of the likelihood instead. For historical reasons, optimizations are 
more often expressed as minimization rather than maximization. So, without changing any- 
thing, we can minimize the negative log-likelihood, which we can express as follows: 


2 2 
-loa Pty 1X) =D) Plogo?) +5 (y—w'x 5). (8.1.16) 
If we assume that o is fixed, we can ignore the first term, because it does not depend on w 
or b. The second term is identical to the squared error loss introduced earlier, except for 
the multiplicative constant a. Fortunately, the solution does not depend on co either. It 
follows that minimizing the mean squared error is equivalent to the maximum likelihood 
estimation of a linear model under the assumption of additive Gaussian noise. 


3.1.4 Linear Regression as a Neural Network 


While linear models are not sufficiently rich to express the many complicated networks 
that we will introduce in this book, (artificial) neural networks are rich enough to subsume 
linear models as networks in which every feature is represented by an input neuron, all of 
which are connected directly to the output. 


Fig. 3.1.2 depicts linear regression as a neural network. The diagram highlights the con- 
nectivity pattern, such as how each input is connected to the output, but not the specific 
values taken by the weights or biases. 


Output layer 


Input layer 


Linear regression is a single-layer neural network. 


The inputs are x1,...,xq. We refer to d as the number of inputs or the feature dimensional- 
ity in the input layer. The output of the network is o1. Because we are just trying to predict 
a single numerical value, we have only one output neuron. Note that the input values are all 
given. There is just a single computed neuron. In summary, we can think of linear regres- 
sion as a single-layer fully connected neural network. We will encounter networks with far 
more layers in later chapters. 
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Biology 


Because linear regression predates computational neuroscience, it might seem anachro- 
nistic to describe linear regression in terms of neural networks. Nonetheless, they were a 
natural place to start when the cyberneticists and neurophysiologists Warren McCulloch 
and Walter Pitts began to develop models of artificial neurons. Consider the cartoonish 
picture of a biological neuron in Fig. 3.1.3, consisting of dendrites (input terminals), the 
nucleus (CPU), the axon (output wire), and the axon terminals (output terminals), enabling 
connections to other neurons via synapses. 


Dendrite 


Axon Terminal 


Node of 


none 
Cell body “Anyer 


Schwann cell 


Myelin sheath 
Nucleus 


The real neuron (source: “Anatomy and Physiology” by the US National Cancer 
Institute’s Surveillance, Epidemiology and End Results (SEER) Program). 


Information x; arriving from other neurons (or environmental sensors) is received in the 
dendrites. In particular, that information is weighted by synaptic weights w;, determining 
the effect of the inputs, e.g., activation or inhibition via the product x;w;. The weighted 
inputs arriving from multiple sources are aggregated in the nucleus as a weighted sum y = 
Ži xiwi + b, possibly subject to some nonlinear postprocessing via a function o (y). This 
information is then sent via the axon to the axon terminals, where it reaches its destination 
(e.g., an actuator such as a muscle) or it is fed into another neuron via its dendrites. 


Certainly, the high-level idea that many such units could be combined, provided they have 
the correct connectivity and learning algorithm, to produce far more interesting and com- 
plex behavior than any one neuron alone could express arises from our study of real bi- 
ological neural systems. At the same time, most research in deep learning today draws 
inspiration from a much wider source. We invoke Russell and Norvig (2016) who pointed 
out that although airplanes might have been inspired by birds, ornithology has not been 
the primary driver of aeronautics innovation for some centuries. Likewise, inspiration in 
deep learning these days comes in equal or greater measure from mathematics, linguistics, 
psychology, statistics, computer science, and many other fields. 


3.1.5 Summary 


In this section, we introduced traditional linear regression, where the parameters of a linear 
function are chosen to minimize squared loss on the training set. We also motivated this 
choice of objective both via some practical considerations and through an interpretation 
of linear regression as maximimum likelihood estimation under an assumption of linearity 
and Gaussian noise. After discussing both computational considerations and connections to 
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statistics, we showed how such linear models could be expressed as simple neural networks 
where the inputs are directly wired to the output(s). While we will soon move past linear 
models altogether, they are sufficient to introduce most of the components that all of our 
models require: parametric forms, differentiable objectives, optimization via minibatch 
stochastic gradient descent, and ultimately, evaluation on previously unseen data. 


3.1.6 Exercises 


1. Assume that we have some data x1,...,X, € R. Our goal is to find a constant b such 
that X; (x; — b)? is minimized. 


1. Find an analytic solution for the optimal value of b. 
2. How does this problem and its solution relate to the normal distribution? 


3. What if we change the loss from X}; (x; — b)? to X; |x; — b|? Can you find the optimal 
solution for b? 


2. Prove that the affine functions that can be expressed by x' w + b are equivalent to linear 
functions on (x, 1). 


3. Assume that you want to find quadratic functions of x, i.e., f(x) = b + Dj wixi + 
Dij<i WijXix;. How would you formulate this in a deep network? 


4. Recall that one of the conditions for the linear regression problem to be solvable was 
that the design matrix XTX has full rank. 


1. What happens if this is not the case? 


2. How could you fix it? What happens if you add a small amount of coordinate-wise 
independent Gaussian noise to all entries of X? 


3. What is the expected value of the design matrix XTX in this case? 
4. What happens with stochastic gradient descent when XTX does not have full rank? 


5. Assume that the noise model governing the additive noise € is the exponential distribu- 
tion. That is, p(€) = 5 exp(—|el). 


1. Write out the negative log-likelihood of the data under the model — log P(y | X). 
2. Can you find a closed form solution? 


3. Suggest a minibatch stochastic gradient descent algorithm to solve this problem. 
What could possibly go wrong (hint: what happens near the stationary point as we 
keep on updating the parameters)? Can you fix this? 


6. Assume that we want to design a neural network with two layers by composing two 
linear layers. That is, the output of the first layer becomes the input of the second layer. 
Why would such a naive composition not work? 


7. What happens if you want to use regression for realistic price estimation of houses or 
stock prices? 
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1. Show that the additive Gaussian noise assumption is not appropriate. Hint: can we 
have negative prices? What about fluctuations? 


2. Why would regression to the logarithm of the price be much better, i.e., y = log price? 


3. What do you need to worry about when dealing with pennystock, i.e., stock with very 
low prices? Hint: can you trade at all possible prices? Why is this a bigger problem 
for cheap stock? For more information review the celebrated Black-Scholes model 
for option pricing (Black and Scholes, 1973). 


8. Suppose we want to use regression to estimate the number of apples sold in a grocery 
store. 


1. What are the problems with a Gaussian additive noise model? Hint: you are selling 
apples, not oil. 


2. The Poisson distribution S captures distributions over counts. It is given by p(k | 
A) = A*e~*/k!. Here A is the rate function and k is the number of events you see. 
Prove that 4 is the expected value of counts k. 


3. Design a loss function associated with the Poisson distribution. 


4. Design a loss function for estimating log 4 instead. 


ghee 


Discussions®. 


Ez 


3.2 Object-Oriented Design for Implementation 
=m o.8wvuz ==] 


In our introduction to linear regression, we walked through various components including 
the data, the model, the loss function, and the optimization algorithm. Indeed, linear re- 
gression is one of the simplest machine learning models. Training it, however, uses many of 
the same components that other models in this book require. Therefore, before diving into 
the implementation details it is worth designing some of the APIs that we use throughout. 
Treating components in deep learning as objects, we can start by defining classes for these 
objects and their interactions. This object-oriented design for implementation will greatly 
streamline the presentation and you might even want to use it in your projects. 


* Inspired by open-source libraries such as PyTorch Lightning “°, at a high level we wish 
to have three classes: (i) Module contains models, losses, and optimization methods; (ii) 
DataModule provides data loaders for training and validation; (iii) both classes are com- 
bined using the Trainer class, which allows us to train models on a variety of hardware 
platforms. Most code in this book adapts Module and DataModule. We will touch upon 
the Trainer class only when we discuss GPUs, CPUs, parallel training, and optimization 
algorithms. 
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import time 

import numpy as np 

import torch 

from torch import nn 

from d21 import torch as d21 


3.2.1 Utilities 


We need a few utilities to simplify object-oriented programming in Jupyter notebooks. One 
of the challenges is that class definitions tend to be fairly long blocks of code. Notebook 
readability demands short code fragments, interspersed with explanations, a requirement 
incompatible with the style of programming common for Python libraries. The first utility 
function allows us to register functions as methods in a class after the class has been created. 
In fact, we can do so even after we have created instances of the class! It allows us to split 
the implementation of a class into multiple code blocks. 


def add_to_class(Class): #@save 
"""Register functions as methods in created class. 
def wrapper (obj): 
setattr(Class, obj.__name 
return wrapper 


nnn 


obj) 


—— >) 


Let’s have a quick look at how to use it. We plan to implement a class A with a method do. 
Instead of having code for both A and do in the same code block, we can first declare the 
class A and create an instance a. 


class A: 
def __init__(self): 
self.b = 1 
a= AQ 


Next we define the method do as we normally would, but not in class A’s scope. Instead, 
we decorate this method by add_to_class with class A as its argument. In doing so, the 
method is able to access the member variables of A just as we would expect had it been 
included as part of A’s definition. Let’s see what happens when we invoke it for the instance 
a. 


@add_to_class(A) 
def do(self): 
print(’Class attribute "b” is’, self.b) 


a.do() 


Class attribute "b” is 1 


The second one is a utility class that saves all arguments in a class’s __init__ method 
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as class attributes. This allows us to extend constructor call signatures implicitly without 
additional code. 


class HyperParameters: #@save 
"""The base class of hyperparameters. 
def save_hyperparameters(self, ignore=[]): 
raise NotImplemented 


nnn 


We defer its implementation into Section B.7. To use it, we define our class that inherits 
from HyperParameters and calls save_hyperparametersinthe__init__ method. 


# Call the fully implemented HyperParameters class saved in d21 
class B(d21.HyperParameters) : 
def __init__(self, a, b, c): 
self.save_hyperparameters(ignore=['c']) 
print(’self.a =', self.a, ‘'self.b =', self.b) 
print('There is no self.c =', not hasattr(self, 'c')) 


b = B(a=1, b=2, c=3) 


self.a = 1 self.b = 2 
There is no self.c = True 


The final utility allows us to plot experiment progress interactively while it is going on. 
In deference to the much more powerful (and complex) TensorBoard“! we name it Pro- 
gressBoard. The implementation is deferred to Section B.7. For now, let’s simply see it 
in action. 


The draw method plots a point (x, y) in the figure, with label specified in the legend. 
The optional every_n smooths the line by only showing 1/n points in the figure. Their 
values are averaged from the n neighbor points in the original figure. 


class ProgressBoard(d21.HyperParameters): #@save 
"""The board that plots data points in animation. 
def __init__(self, xlabel=None, ylabel=None, xlim=None, 
ylim=None, xscale='linear', yscale='linear’, 
iesit, Y=? =", "27, oor o "Ci, "C2", "C3, 
fig=None, axes=None, figsize=(3.5, 2.5), display=True): 
self .save_hyperparameters() 


nnn 


def draw(self, x, y, label, every_n=1): 
raise NotImplemented 


In the following example, we draw sin and cos with a different smoothness. If you run this 
code block, you will see the lines grow in animation. 


board = d21.ProgressBoard(’x’') 
for x in np.arange(@, 10, 0.1): 


board.draw(x, np.sin(x), ‘sin’, every_n=2) 
board.draw(x, np.cos(x), ‘cos’, every_n=10) 
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3.2.2 Models 


The Module class is the base class of all models we will implement. At the very least 


we need three methods. The first, __init stores the learnable parameters, the train- 
ing_step method accepts a data batch to return the loss value, and finally, configure_optimizers 
returns the optimization method, or a list of them, that is used to update the learnable pa- 
rameters. Optionally we can define validation_step to report the evaluation measures. 
Sometimes we put the code for computing the output into a separate forward method to 


make it more reusable. 


— 


class Module(nn.Module, d21.HyperParameters): #@save 
"""The base class of models.””” 
def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1): 
super().__init__Q 
self.save_hyperparameters() 
self.board = ProgressBoard() 


def loss(self, y_hat, y): 
raise NotImplementedError 


def forward(self, X): 
assert hasattr(self, 'net’), '’Neural network is defined’ 
return self.net(X) 


def plot(self, key, value, train): 
(Plot a points iN animat Loner ae 
assert hasattr(self, ‘trainer’), ‘Trainer is not inited' 
self.board.xlabel = ‘epoch’ 
if train: 
x = self.trainer.train_batch_idx / \ 
self.trainer.num_train_batches 
n = self.trainer.num_train_batches / \ 
self .plot_train_per_epoch 
else: 


x 
I 


self.trainer.epoch + 1 
n = self.trainer.num_val_batches / \ 
self .plot_valid_per_epoch 
self.board.draw(x, value.to(d21.cpu()).detach() .numpy(), 
('train_' if train else 'val_’) + key, 
every_n=int(n)) 


(continues on next page) 
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(continued from previous page) 


def training_step(self, batch): 
1 = self.loss(self(*batch[:-1]), batch[-1]) 
self.plot(’loss’, 1, train=True) 
return 1 


def validation_step(self, batch): 
1 = self.loss(self(*batch[:-1]), batch[-1]) 
self.plot(’loss’, 1, train=False) 


def configure_optimizers(self): 
raise NotImplementedError 


You may notice that Module is a subclass of nn.Module, the base class of neural networks 
in PyTorch. It provides convenient features for handling neural networks. For example, if 
we define a forward method, such as forward(self, X), then for an instance a we can 
invoke this method by a(X). This works since it calls the forward method in the built-in 
__call__ method. You can find more details and examples about nn.Module in Section 
6.1. 


3.2.3 Data 


The DataModule class is the base class for data. Quite frequently the __init__ method is 
used to prepare the data. This includes downloading and preprocessing if needed. The 
train_dataloader returns the data loader for the training dataset. A data loader is a 
(Python) generator that yields a data batch each time it is used. This batch is then fed 
into the training_step method of Module to compute the loss. There is an optional 
val_dataloader to return the validation dataset loader. It behaves in the same manner, 
except that it yields data batches for the validation_step method in Module. 


class DataModule(d21.HyperParameters): #@save 
"""The base class of data.”"" 
def __init__(self, root='../data’, num_workers=4): 
self.save_hyperparameters() 


def get_dataloader(self, train): 
raise NotImplementedError 


def train_dataloader(self): 
return self.get_dataloader(train=True) 


def val_dataloader(self): 
return self.get_dataloader(train=False) 


3.2.4 Training 


The Trainer class trains the learnable parameters in the Module class with data specified 
in DataModule. The key method is fit, which accepts two arguments: model, an instance 
of Module, and data, an instance of DataModule. It then iterates over the entire dataset 
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max_epochs times to train the model. As before, we will defer the implementation of this 
method to later chapters. 


class Trainer(d21.HyperParameters): #@save 
"""The base class for training models with data. 
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0): 
self.save_hyperparameters() 
assert num_gpus == 2, 'No GPU support yet’ 


nnn 


def prepare_data(self, data): 
self.train_dataloader = data.train_dataloader() 
self.val_dataloader = data.val_dataloader() 
self.num_train_batches = len(self.train_dataloader) 
self.num_val_batches = (len(self.val_dataloader) 
if self.val_dataloader is not None else ð) 


def prepare_model(self, model): 
model.trainer = self 
model.board.xlim = [@, self.max_epochs] 
self.model = model 


def fit(self, model, data): 
self .prepare_data(data) 
self .prepare_model (model) 
self.optim = model.configure_optimizers() 
self.epoch = @ 
self.train_batch_idx = ð 
self.val_batch_idx = 0 
for self.epoch in range(self.max_epochs) : 
self .fit_epoch() 


def fit_epoch(self): 
raise NotImplementedError 


3.2.5 Summary 


To highlight the object-oriented design for our future deep learning implementation, the 

above classes simply show how their objects store data and interact with each other. We 

will keep enriching implementations of these classes, such as via @add_to_class, in the 
4m rest of the book. Moreover, these fully implemented classes are saved in the D2L library”? 
i, a lightweight toolkit that makes structured modeling for deep learning easy. In particular, 
it facilitates reusing many components between projects without changing much at all. For 
instance, we can replace just the optimizer, just the model, just the dataset, etc.; this degree 
of modularity pays dividends throughout the book in terms of conciseness and simplicity 
(this is why we added it) and it can do the same for your own projects. 


3.2.6 Exercises 


1. Locate full implementations of the above classes that are saved in the D2L library "3 
. We strongly recommend that you look at the implementation in detail once you have 
gained some more familiarity with deep learning modeling. 
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2. Remove the save_hyperparameters statement in the B class. Can you still print self .a 
and self .b? Optional: if you have dived into the full implementation of the HyperPa- 
rameters class, can you explain why? 


Discussions. 


3.3 Synthetic Regression Data 
E) 


Machine learning is all about extracting information from data. So you might wonder, 
what could we possibly learn from synthetic data? While we might not care intrinsically 
about the patterns that we ourselves baked into an artificial data generating model, such 
datasets are nevertheless useful for didactic purposes, helping us to evaluate the properties 
of our learning algorithms and to confirm that our implementations work as expected. For 
example, if we create data for which the correct parameters are known a priori, then we 
can check that our model can in fact recover them. 


%matplotlib inline 

import random 

import torch 

from d21 import torch as d21 


3.3.1 Generating the Dataset 


For this example, we will work in low dimension for succinctness. The following code 
snippet generates 1000 examples with 2-dimensional features drawn from a standard nor- 
mal distribution. The resulting design matrix X belongs to R!000*2, We generate each label 
by applying a ground truth linear function, corrupting them via additive noise €, drawn in- 
dependently and identically for each example: 


y=Xw+b+e. (3.3.1) 


For convenience we assume that € is drawn from a normal distribution with mean u = 0 
and standard deviation o = 0.01. Note that for object-oriented design we add the code to 
the __init__ method of a subclass of d21.DataModule (introduced in Section 3.2.3). Itis 
good practice to allow the setting of any additional hyperparameters. We accomplish this 
with save_hyperparameters(). The batch_size will be determined later. 


class SyntheticRegressionData(d21.DataModule): #@save 
"""Synthetic data for linear regression.””"” 
def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000, 
batch_size=32): 
super().__init__Q 
self.save_hyperparameters() 
n = num_train + num_val 


(continues on next page) 
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self.X = torch.randn(n, len(w)) 
noise = torch.randn(n, 1) * noise 
self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise 


Below, we set the true parameters to w = [2, —3.4] T and b = 4.2. Later, we can check our 
estimated parameters against these ground truth values. 


data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2) 


Each row in features consists of a vector in R? and each row in labels is a scalar. Let’s 
have a look at the first entry. 


print(’features:’, data.X[0],’\nlabel:', data.y[@]) 


features: tensor([0.9026, 1.0264]) 
label: tensor([2.5148]) 


3.3.2 Reading the Dataset 


Training machine learning models often requires multiple passes over a dataset, grabbing 
one minibatch of examples at a time. This data is then used to update the model. To 
illustrate how this works, we implement the get_dataloader method, registering it in 
the SyntheticRegressionData class via add_to_class (introduced in Section 3.2.1). It 
takes a batch size, a matrix of features, and a vector of labels, and generates minibatches of 
size batch_size. As such, each minibatch consists of a tuple of features and labels. Note 
that we need to be mindful of whether we’re in training or validation mode: in the former, 
we will want to read the data in random order, whereas for the latter, being able to read data 
in a pre-defined order may be important for debugging purposes. 


@d21.add_to_class(SyntheticRegressionData) 
def get_dataloader(self, train): 
if train: 
indices = list(range(@, self.num_train)) 
# The examples are read in random order 
random. shuffle(indices) 
else: 
indices = list(range(self.num_train, self.num_traint+self.num_val)) 
for i in range(@, len(indices), self.batch_size): 
batch_indices = torch.tensor(indicesLi: i+self.batch_size]) 
yield self.X[batch_indices], self.y[batch_indices] 


To build some intuition, let’s inspect the first minibatch of data. Each minibatch of fea- 
tures provides us with both its size and the dimensionality of input features. Likewise, our 
minibatch of labels will have a matching shape given by batch_size. 


101 


Synthetic Regression Data 


X, y = next(iter(data. train_dataloader())) 
print(’X shape:'’, X.shape, ‘\ny shape:’, y.shape) 


X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1]) 


While seemingly innocuous, the invocation of iter(data. train_dataloader()) illus- 
trates the power of Python’s object-oriented design. Note that we added a method to the 
SyntheticRegressionData class after creating the data object. Nonetheless, the object 
benefits from the ex post facto addition of functionality to the class. 


Throughout the iteration we obtain distinct minibatches until the entire dataset has been 
exhausted (try this). While the iteration implemented above is good for didactic purposes, 
it is inefficient in ways that might get us into trouble with real problems. For example, it 
requires that we load all the data in memory and that we perform lots of random memory 
access. The built-in iterators implemented in a deep learning framework are considerably 
more efficient and they can deal with sources such as data stored in files, data received via 
a stream, and data generated or processed on the fly. Next let’s try to implement the same 
method using built-in iterators. 


3.3.3 Concise Implementation of the Data Loader 


Rather than writing our own iterator, we can call the existing API in a framework to load 
data. As before, we need a dataset with features X and labels y. Beyond that, we set 
batch_size in the built-in data loader and let it take care of shuffling examples effi- 
ciently. 


@d21.add_to_class(d21.DataModule) #@save 
def get_tensorloader(self, tensors, train, indices=slice(@, None)): 
tensors = tuple(aLindices] for a in tensors) 
dataset = torch.utils.data.TensorDataset(*tensors) 
return torch.utils.data.DataLoader(dataset, self.batch_size, 
shuffle=train) 


@d21.add_to_class(SyntheticRegressionData) #@save 

def get_dataloader(self, train): 
i = slice(@, self.num_train) if train else slice(self.num_train, None) 
return self.get_tensorloader((self.X, self.y), train, i) 


The new data loader behaves just like the previous one, except that it is more efficient and 
has some added functionality. 


X, y = next(iter(data.train_dataloader())) 
print(’X shape:'’, X.shape, '\ny shape:’, y.shape) 
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X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1]) 


For instance, the data loader provided by the framework API supports the built-in __len__ 
method, so we can query its length, i.e., the number of batches. 


len(data. train_dataloader()) 


32 


3.3.4 Summary 


Data loaders are a convenient way of abstracting out the process of loading and manipu- 
lating data. This way the same machine learning algorithm is capable of processing many 
different types and sources of data without the need for modification. One of the nice things 
about data loaders is that they can be composed. For instance, we might be loading images 
and then have a postprocessing filter that crops them or modifies them in other ways. As 
such, data loaders can be used to describe an entire data processing pipeline. 


As for the model itself, the two-dimensional linear model is about the simplest we might 
encounter. It lets us test out the accuracy of regression models without worrying about 
having insufficient amounts of data or an underdetermined system of equations. We will 
put this to good use in the next section. 


3.3.5 Exercises 


1. What will happen if the number of examples cannot be divided by the batch size. How 
would you change this behavior by specifying a different argument by using the frame- 
work’s API? 


2. Suppose that we want to generate a huge dataset, where both the size of the parameter 
vector w and the number of examples num_examples are large. 


1. What happens if we cannot hold all data in memory? 


2. How would you shuffle the data if itis held on disk? Your task is to design an efficient 
algorithm that does not require too many random reads or writes. Hint: pseudoran- 
dom permutation generators ° allow you to design a reshuffle without the need to 
store the permutation table explicitly (Naor and Reingold, 1999). 


3. Implement a data generator that produces new data on the fly, every time the iterator is 
called. 


4. How would you design a random data generator that generates the same data each time 
it is called? 


Discussions ô. 


103 


Linear Regression Implementation from Scratch 


3.4 Linear Regression Implementation from Scratch 
SSS 


We are now ready to work through a fully functioning implementation of linear regression. 
In this section, we will implement the entire method from scratch, including (i) the model; 
(ii) the loss function; (iii) a minibatch stochastic gradient descent optimizer; and (iv) the 
training function that stitches all of these pieces together. Finally, we will run our synthetic 
data generator from Section 3.3 and apply our model on the resulting dataset. While modern 
deep learning frameworks can automate nearly all of this work, implementing things from 
scratch is the only way to make sure that you really know what you are doing. Moreover, 
when it is time to customize models, defining our own layers or loss functions, understand- 
ing how things work under the hood will prove handy. In this section, we will rely only 
on tensors and automatic differentiation. Later, we will introduce a more concise imple- 
mentation, taking advantage of the bells and whistles of deep learning frameworks while 
retaining the structure of what follows below. 


%matplotlib inline 
import torch 
from d21 import torch as d21 


3.4.1 Defining the Model 


Before we can begin optimizing our model’s parameters by minibatch SGD, we need to 
have some parameters in the first place. In the following we initialize weights by drawing 
random numbers from a normal distribution with mean 0 and a standard deviation of 0.01. 
The magic number 0.01 often works well in practice, but you can specify a different value 
through the argument sigma. Moreover we set the bias to 0. Note that for object-oriented 
design we add the code to the __init__ method of a subclass of d21.Module (introduced 
in Section 3.2.2). 


class LinearRegressionScratch(d21.Module): #@save 
"""The linear regression model implemented from scratch.""” 
def __init__(self, num_inputs, lr, sigma=0.01): 
super().__init__Q 
self.save_hyperparameters() 
self.w = torch.normal(@, sigma, (num_inputs, 1), requires_grad=True) 
self.b = torch.zeros(1, requires_grad=True) 


Next we must define our model, relating its input and parameters to its output. Using the 
same notation as (3.1.4) for our linear model we simply take the matrix—vector product of 
the input features X and the model weights w, and add the offset b to each example. The 
product Xw is a vector and b is a scalar. Because of the broadcasting mechanism (see 
Section 2.1.4), when we add a vector and a scalar, the scalar is added to each component of 
the vector. The resulting forward method is registered in the LinearRegressionScratch 
class via add_to_class (introduced in Section 3.2.1). 
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@d21.add_to_class(LinearRegressionScratch) #@save 
def forward(self, X): 
return torch.matmul(X, self.w) + self.b 


3.4.2 Defining the Loss Function 


Since updating our model requires taking the gradient of our loss function, we ought to 
define the loss function first. Here we use the squared loss function in (3.1.5). In the 
implementation, we need to transform the true value y into the predicted value’s shape 
y_hat. The result returned by the following method will also have the same shape as y_hat. 
We also return the averaged loss value among all examples in the minibatch. 


@d21.add_to_class(LinearRegressionScratch) #@save 
def loss(self, y_hat, y): 

l = (y_hat - y) ** 2 / 2 

return 1.mean() 


3.4.3 Defining the Optimization Algorithm 


As discussed in Section 3.1, linear regression has a closed-form solution. However, our 
goal here is to illustrate how to train more general neural networks, and that requires that 
we teach you how to use minibatch SGD. Hence we will take this opportunity to introduce 
your first working example of SGD. At each step, using a minibatch randomly drawn from 
our dataset, we estimate the gradient of the loss with respect to the parameters. Next, we 
update the parameters in the direction that may reduce the loss. 


The following code applies the update, given a set of parameters, a learning rate 1r. Since 
our loss is computed as an average over the minibatch, we do not need to adjust the learning 
rate against the batch size. In later chapters we will investigate how learning rates should 
be adjusted for very large minibatches as they arise in distributed large-scale learning. For 
now, we can ignore this dependency. 


We define our SGD class, a subclass of d21.HyperParameters (introduced in Section 3.2.1), 
to have a similar API as the built-in SGD optimizer. We update the parameters in the step 
method. The zero_grad method sets all gradients to 0, which must be run before a back- 
propagation step. 


class SGD(d21.HyperParameters): #@save 
"""Minibatch stochastic gradient descent. 
def __init__(self, params, Ir): 
self.save_hyperparameters() 


non 


def step(self): 
for param in self.params: 
param -= self.lr * param.grad 
def zero_grad(self): 


(continues on next page) 
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for param in self.params: 
if param.grad is not None: 
param. grad. zero_() 


We next define the configure_optimizers method, which returns an instance of the SGD 
class. 


@d21.add_to_class(LinearRegressionScratch) #@save 
def configure_optimizers(self): 
return SGD([self.w, self.b], self.1r) 


3.4.4 Training 


Now that we have all of the parts in place (parameters, loss function, model, and optimizer), 
we are ready to implement the main training loop. It is crucial that you understand this 
code fully since you will employ similar training loops for every other deep learning model 
covered in this book. In each epoch, we iterate through the entire training dataset, passing 
once through every example (assuming that the number of examples is divisible by the 
batch size). In each iteration, we grab a minibatch of training examples, and compute its 
loss through the model’s training_step method. Then we compute the gradients with 
respect to each parameter. Finally, we will call the optimization algorithm to update the 
model parameters. In summary, we will execute the following loop: 


e Initialize parameters (w, b) 

e Repeat until done 
— Compute gradient g — dwn) BI Lieg L(x, y™, w, b) 
— Update parameters (w, b) — (w, b) - ng 


Recall that the synthetic regression dataset that we generated in Section 3.3 does not provide 
a validation dataset. In most cases, however, we will want a validation dataset to measure 
our model quality. Here we pass the validation dataloader once in each epoch to mea- 
sure the model performance. Following our object-oriented design, the prepare_batch 
and fit_epoch methods are registered in the d21.Trainer class (introduced in Section 
3.2.4). 


@d21.add_to_class(d21.Trainer) #@save 
def prepare_batch(self, batch): 
return batch 


@d21.add_to_class(d21.Trainer) #@save 
def fit_epoch(self): 
self .model.train() 
for batch in self.train_dataloader: 
loss = self.model.training_step(self.prepare_batch(batch)) 


(continues on next page) 
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self.optim. zero_grad() 
with torch.no_grad(): 
loss. backward() 
if self.gradient_clip_val > @: # To be discussed later 
self.clip_gradients(self.gradient_clip_val, self.model) 
self.optim. step() 
self.train_batch_idx += 1 
if self.val_dataloader is None: 
return 
self .model.eval() 
for batch in self.val_dataloader: 
with torch.no_grad(): 
self.model.validation_step(self.prepare_batch(batch) ) 
self.val_batch_idx += 1 


We are almost ready to train the model, but first we need some training data. Here we use 
the SyntheticRegressionData class and pass in some ground truth parameters. Then 
we train our model with the learning rate 1r=0.3 and set max_epochs=3. Note that in 
general, both the number of epochs and the learning rate are hyperparameters. In general, 
setting hyperparameters is tricky and we will usually want to use a three-way split, one 
set for training, a second for hyperparameter selection, and the third reserved for the final 
evaluation. We elide these details for now but will revise them later. 


model = LinearRegressionScratch(2, 1r=0.03) 

data = d21.SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2) 
trainer = d21.Trainer(max_epochs=3) 

trainer.fit(model, data) 


104 — train_loss 
==- val_loss 
8 4 
6 4 
44 
2 4 
0 4 


Because we synthesized the dataset ourselves, we know precisely what the true parameters 
are. Thus, we can evaluate our success in training by comparing the true parameters with 
those that we learned through our training loop. Indeed they turn out to be very close to 
each other. 


with torch.no_grad(): 
print(f'error in estimating w: {data.w - model.w.reshape(data.w.shape) }') 
print(f’error in estimating b: {data.b - model.b}’) 
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error in estimating w: tensor([ 2.1408, -0.1493]) 
error in estimating b: tensor([Q.2130]) 


We should not take the ability to exactly recover the ground truth parameters for granted. 
In general, for deep models unique solutions for the parameters do not exist, and even 
for linear models, exactly recovering the parameters is only possible when no feature is 
linearly dependent on the others. However, in machine learning, we are often less concerned 
with recovering true underlying parameters, but rather with parameters that lead to highly 
accurate prediction (Vapnik, 1992). Fortunately, even on difficult optimization problems, 
stochastic gradient descent can often find remarkably good solutions, owing partly to the 
fact that, for deep networks, there exist many configurations of the parameters that lead to 
highly accurate prediction. 


3.4.5 Summary 


In this section, we took a significant step towards designing deep learning systems by im- 
plementing a fully functional neural network model and training loop. In this process, we 
built a data loader, a model, a loss function, an optimization procedure, and a visualization 
and monitoring tool. We did this by composing a Python object that contains all relevant 
components for training a model. While this is not yet a professional-grade implementation 
it is perfectly functional and code like this could already help you to solve small problems 
quickly. In the coming sections, we will see how to do this both more concisely (avoiding 
boilerplate code) and more efficiently (using our GPUs to their full potential). 


3.4.6 Exercises 


1. What would happen if we were to initialize the weights to zero. Would the algorithm 
still work? What if we initialized the parameters with variance 1000 rather than 0.01? 


2. Assume that you are Georg Simon Ohm” trying to come up with a model for resis- 
tance that relates voltage and current. Can you use automatic differentiation to learn the 
parameters of your model? 


. Can you use Planck’s Law“® to determine the temperature of an object using spectral 


energy density? For reference, the spectral density B of radiation emanating from a 
; -1 

black body is B(A,T) = zhe . (exp Ae. - 1) . Here A is the wavelength, T is the 

temperature, c is the speed of light, h is Planck’s constant, and k is the Boltzmann 

constant. You measure the energy for different wavelengths 4 and you now need to fit 


the spectral density curve to Planck’s law. 


4. What are the problems you might encounter if you wanted to compute the second deriva- 
tives of the loss? How would you fix them? 


5. Why is the reshape method needed in the loss function? 


6. Experiment using different learning rates to find out how quickly the loss function value 
drops. Can you reduce the error by increasing the number of epochs of training? 
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7. Ifthe number of examples cannot be divided by the batch size, what happens to data_iter 
at the end of an epoch? 


8. Try implementing a different loss function, such as the absolute value loss (y_hat - 
d21.reshape(y, y_hat.shape)).abs().sum(). 


1. Check what happens for regular data. 


2. Check whether there is a difference in behavior if you actively perturb some entries, 
such as ys = 10000, of y. 


3. Can you think of a cheap solution for combining the best aspects of squared loss and 
absolute value loss? Hint: how can you avoid really large gradient values? 


9. Why do we need to reshuffle the dataset? Can you design a case where a maliciously 
constructed dataset would break the optimization algorithm otherwise? 


Discussions”? . 


3.5 Concise Implementation of Linear Regression 
E) 


Deep learning has witnessed a sort of Cambrian explosion over the past decade. The sheer 
number of techniques, applications and algorithms by far surpasses the progress of pre- 
vious decades. This is due to a fortuitous combination of multiple factors, one of which 
is the powerful free tools offered by a number of open-source deep learning frameworks. 
Theano (Bergstra et al., 2010), DistBelief (Dean et al., 2012), and Caffe (Jia et al., 2014) 
arguably represent the first generation of such models that found widespread adoption. 
In contrast to earlier (seminal) works like SN2 (Simulateur Neuristique) (Bottou and Le 
Cun, 1988), which provided a Lisp-like programming experience, modern frameworks of- 
fer automatic differentiation and the convenience of Python. These frameworks allow us 
to automate and modularize the repetitive work of implementing gradient-based learning 
algorithms. 


In Section 3.4, we relied only on (i) tensors for data storage and linear algebra; and (ii) 
automatic differentiation for calculating gradients. In practice, because data iterators, loss 
functions, optimizers, and neural network layers are so common, modern libraries imple- 
ment these components for us as well. In this section, we will show you how to implement 
the linear regression model from Section 3.4 concisely by using high-level APIs of deep 
learning frameworks. 


import numpy as np 

import torch 

from torch import nn 

from d21 import torch as d21 
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3.5.1 Defining the Model 


When we implemented linear regression from scratch in Section 3.4, we defined our model 
parameters explicitly and coded up the calculations to produce output using basic linear 
algebra operations. You should know how to do this. But once your models get more 
complex, and once you have to do this nearly every day, you will be glad of the assistance. 
The situation is similar to coding up your own blog from scratch. Doing it once or twice 
is rewarding and instructive, but you would be a lousy web developer if you spent a month 
reinventing the wheel. 


For standard operations, we can use a framework’s predefined layers, which allow us to 
focus on the layers used to construct the model rather than worrying about their implemen- 
tation. Recall the architecture of a single-layer network as described in Fig. 3.1.2. The 
layer is called fully connected, since each of its inputs is connected to each of its outputs 
by means of a matrix—vector multiplication. 


In PyTorch, the fully connected layer is defined in Linear and LazyLinear classes (avail- 
able since version 1.8.0). The latter allows users to specify merely the output dimension, 
while the former additionally asks for how many inputs go into this layer. Specifying input 
shapes is inconvenient and may require nontrivial calculations (such as in convolutional 
layers). Thus, for simplicity, we will use such “lazy” layers whenever we can. 


class LinearRegression(d21.Module): #@save 
"""The linear regression model implemented with high-level APIs. 
def __init__(self, Ir): 
super().__init__Q 
self.save_hyperparameters() 
self.net = nn.LazyLinear(1) 
self.net.weight.data.normal_(@, 0.01) 
self.net.bias.data.fill_(@) 


nnn 


In the forward method we just invoke the built-in __call__ method of the predefined 
layers to compute the outputs. 


@d21.add_to_class(LinearRegression) #@save 
def forward(self, X): 
return self.net(X) 


3.5.2 Defining the Loss Function 


The MSELoss class computes the mean squared error (without the 1/2 factor in (3.1.5)). 
By default, MSELoss returns the average loss over examples. It is faster (and easier to use) 
than implementing our own. 


@d21.add_to_class(LinearRegression) #@save 
def loss(self, y_hat, y): 

fn = nn.MSELoss() 

return fn(y_hat, y) 
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3.5.3 Defining the Optimization Algorithm 


Minibatch SGD is a standard tool for optimizing neural networks and thus PyTorch sup- 
ports it alongside a number of variations on this algorithm in the optim module. When we 
instantiate an SGD instance, we specify the parameters to optimize over, obtainable from 
our model via self .parameters(), and the learning rate (self .1r) required by our opti- 
mization algorithm. 


@d21.add_to_class(LinearRegression) #@save 
def configure_optimizers(self): 
return torch.optim.SGD(self.parameters(), self.1r) 


3.5.4 Training 


You might have noticed that expressing our model through high-level APIs of a deep learn- 
ing framework requires fewer lines of code. We did not have to allocate parameters indi- 
vidually, define our loss function, or implement minibatch SGD. Once we start working 
with much more complex models, the advantages of the high-level API will grow consid- 
erably. 


Now that we have all the basic pieces in place, the training loop itself is the same as the 
one we implemented from scratch. So we just call the fit method (introduced in Section 
3.2.4), which relies on the implementation of the fit_epoch method in Section 3.4, to train 
our model. 


model = LinearRegression(1r=0. 03) 

data = d21.SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2) 
trainer = d21.Trainer(max_epochs=3) 

trainer.fit(model, data) 


154 — train loss 
==. val_loss 
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Below, we compare the model parameters learned by training on finite data and the actual 
parameters that generated our dataset. To access parameters, we access the weights and bias 
of the layer that we need. As in our implementation from scratch, note that our estimated 
parameters are close to their true counterparts. 


@d21.add_to_class(LinearRegression) #@save 


(continues on next page) 
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(continued from previous page) 


def get_w_b(self): 
return (self.net.weight.data, self.net.bias.data) 
w, b = model. get_w_b() 


print(f'error in estimating w: {data.w - w.reshape(data.w. shape) }') 
print(f'error in estimating b: {data.b - b}’) 


error in estimating w: tensor([ 0.0094, -@.0030]) 
error in estimating b: tensor([0@.0137]) 


3.5.5 Summary 


This section contains the first implementation of a deep network (in this book) to tap into 
the conveniences afforded by modern deep learning frameworks, such as MXNet (Chen 
et al., 2015), JAX (Frostig et al., 2018), PyTorch (Paszke et al., 2019), and Tensorflow 
(Abadi et al., 2016). We used framework defaults for loading data, defining a layer, a loss 
function, an optimizer and a training loop. Whenever the framework provides all necessary 
features, it is generally a good idea to use them, since the library implementations of these 
components tend to be heavily optimized for performance and properly tested for reliability. 
At the same time, try not to forget that these modules can be implemented directly. This is 
especially important for aspiring researchers who wish to live on the leading edge of model 
development, where you will be inventing new components that cannot possibly exist in 
any current library. 


In PyTorch, the data module provides tools for data processing, the nn module defines a 
large number of neural network layers and common loss functions. We can initialize the pa- 
rameters by replacing their values with methods ending with _. Note that we need to specify 
the input dimensions of the network. While this is trivial for now, it can have significant 
knock-on effects when we want to design complex networks with many layers. Careful 
considerations of how to parametrize these networks is needed to allow portability. 


3.5.6 Exercises 


1. How would you need to change the learning rate if you replace the aggregate loss over 
the minibatch with an average over the loss on the minibatch? 


2. Review the framework documentation to see which loss functions are provided. In par- 
ticular, replace the squared loss with Huber’s robust loss function. That is, use the loss 
function 


(3.5.1) 


2o 


y_jJly-yl-F ifly-y'l>o 
L(y, )= n2 r 
y- y’) otherwise 


3. How do you access the gradient of the weights of the model? 
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4. What is the effect on the solution if you change the learning rate and the number of 
epochs? Does it keep on improving? 


5. How does the solution change as you vary the amount of data generated? 


1. Plot the estimation error for W — w and 4 — b as a function of the amount of data. 
Hint: increase the amount of data logarithmically rather than linearly, i.e., 5, 10, 20, 
50, ..., 10,000 rather than 1000, 2000, ..., 10,000. 


2. Why is the suggestion in the hint appropriate? 


Discussions ®°. 


3.6 Generalization 
| 


Consider two college students diligently preparing for their final exam. Commonly, this 
preparation will consist of practicing and testing their abilities by taking exams adminis- 
tered in previous years. Nonetheless, doing well on past exams is no guarantee that they will 
excel when it matters. For instance, imagine a student, Extraordinary Ellie, whose prepara- 
tion consisted entirely of memorizing the answers to previous years’ exam questions. Even 
if Ellie were endowed with an extraordinary memory, and thus could perfectly recall the an- 
swer to any previously seen question, she might nevertheless freeze when faced with a new 
(previously unseen) question. By comparison, imagine another student, Inductive Irene, 
with comparably poor memorization skills, but a knack for picking up patterns. Note that 
if the exam truly consisted of recycled questions from a previous year, Ellie would handily 
outperform Irene. Even if Irene’s inferred patterns yielded 90% accurate predictions, they 
could never compete with Ellie’s 100% recall. However, even if the exam consisted entirely 
of fresh questions, Irene might maintain her 90% average. 


As machine learning scientists, our goal is to discover patterns. But how can we be sure that 
we have truly discovered a general pattern and not simply memorized our data? Most of the 
time, our predictions are only useful if our model discovers such a pattern. We do not want 
to predict yesterday’s stock prices, but tomorrow’s. We do not need to recognize already 
diagnosed diseases for previously seen patients, but rather previously undiagnosed ailments 
in previously unseen patients. This problem—how to discover patterns that generalize—is 
the fundamental problem of machine learning, and arguably of all of statistics. We might 
cast this problem as just one slice of a far grander question that engulfs all of science: 
when are we ever justified in making the leap from particular observations to more general 
statements? 


In real life, we must fit our models using a finite collection of data. The typical scales 
of that data vary wildly across domains. For many important medical problems, we can 
only access a few thousand data points. When studying rare diseases, we might be lucky to 
access hundreds. By contrast, the largest public datasets consisting of labeled photographs, 
e.g., ImageNet (Deng et al., 2009), contain millions of images. And some unlabeled image 
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collections such as the Flickr YFC100M dataset can be even larger, containing over 100 
million images (Thomee et al., 2016). However, even at this extreme scale, the number of 
available data points remains infinitesimally small compared to the space of all possible 
images at a megapixel resolution. Whenever we work with finite samples, we must keep in 
mind the risk that we might fit our training data, only to discover that we failed to discover 
a generalizable pattern. 


The phenomenon of fitting closer to our training data than to the underlying distribution is 
called overfitting, and techniques for combatting overfitting are often called regularization 
methods. While it is no substitute for a proper introduction to statistical learning theory 
(see Boucheron et al. (2005), Vapnik (1998)), we will give you just enough intuition to get 
going. We will revisit generalization in many chapters throughout the book, exploring both 
what is known about the principles underlying generalization in various models, and also 
heuristic techniques that have been found (empirically) to yield improved generalization on 
tasks of practical interest. 


3.6.1 Training Error and Generalization Error 


In the standard supervised learning setting, we assume that the training data and the test 
data are drawn independently from identical distributions. This is commonly called the 
IID assumption. While this assumption is strong, it is worth noting that, absent any such 
assumption, we would be dead in the water. Why should we believe that training data 
sampled from distribution P(X, Y) should tell us how to make predictions on test data 
generated by a different distribution Q(X, Y)? Making such leaps turns out to require strong 
assumptions about how P and Q are related. Later on we will discuss some assumptions 
that allow for shifts in distribution but first we need to understand the IID case, where 


P(-) = Q(-). 


To begin with, we need to differentiate between the training error Remp, which is a statistic 
calculated on the training dataset, and the generalization error R, which is an expectation 
taken with respect to the underlying distribution. You can think of the generalization error 
as what you would see if you applied your model to an infinite stream of additional data 
examples drawn from the same underlying data distribution. Formally the training error is 
expressed as a sum (with the same notation as Section 3.1): 


n 


1 f ; ' 
Remp [X, y, f1 = = YM, y, f), (3.6.1) 


i=1 


while the generalization error is expressed as an integral: 


EE E E R E J J I(x, y, f00)p (uy) dxdy. (3.6.2) 


Problematically, we can never calculate the generalization error R exactly. Nobody ever 
tells us the precise form of the density function p(x, y). Moreover, we cannot sample 
an infinite stream of data points. Thus, in practice, we must estimate the generalization 
error by applying our model to an independent test set constituted of a random selection 
of examples X’ and labels y’ that were withheld from our training set. This consists of 
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applying the same formula that was used for calculating the empirical training error but to 


1 


a test set X’, y’. 


Crucially, when we evaluate our classifier on the test set, we are working with a fixed classi- 
fier (it does not depend on the sample of the test set), and thus estimating its error is simply 
the problem of mean estimation. However the same cannot be said for the training set. Note 
that the model we wind up with depends explicitly on the selection of the training set and 
thus the training error will in general be a biased estimate of the true error on the underly- 
ing population. The central question of generalization is then when should we expect our 
training error to be close to the population error (and thus the generalization error). 


Model Complexity 


In classical theory, when we have simple models and abundant data, the training and gen- 
eralization errors tend to be close. However, when we work with more complex models 
and/or fewer examples, we expect the training error to go down but the generalization gap 
to grow. This should not be surprising. Imagine a model class so expressive that for any 
dataset of n examples, we can find a set of parameters that can perfectly fit arbitrary labels, 
even if randomly assigned. In this case, even if we fit our training data perfectly, how can 
we conclude anything about the generalization error? For all we know, our generalization 
error might be no better than random guessing. 


In general, absent any restriction on our model class, we cannot conclude, based on fitting 
the training data alone, that our model has discovered any generalizable pattern (Vapnik et 
al., 1994). On the other hand, if our model class was not capable of fitting arbitrary labels, 
then it must have discovered a pattern. Learning-theoretic ideas about model complexity 
derived some inspiration from the ideas of Karl Popper, an influential philosopher of sci- 
ence, who formalized the criterion of falsifiability. According to Popper, a theory that can 
explain any and all observations is not a scientific theory at all! After all, what has it told us 
about the world if it has not ruled out any possibility? In short, what we want is a hypothesis 
that could not explain any observations we might conceivably make and yet nevertheless 
happens to be compatible with those observations that we in fact make. 


Now what precisely constitutes an appropriate notion of model complexity is a complex 
matter. Often, models with more parameters are able to fit a greater number of arbitrarily 
assigned labels. However, this is not necessarily true. For instance, kernel methods operate 
in spaces with infinite numbers of parameters, yet their complexity is controlled by other 
means (Schölkopf and Smola, 2002). One notion of complexity that often proves useful 
is the range of values that the parameters can take. Here, a model whose parameters are 
permitted to take arbitrary values would be more complex. We will revisit this idea in the 
next section, when we introduce weight decay, your first practical regularization technique. 
Notably, it can be difficult to compare complexity among members of substantially different 
model classes (say, decision trees vs. neural networks). 


At this point, we must stress another important point that we will revisit when introducing 
deep neural networks. When a model is capable of fitting arbitrary labels, low training 
error does not necessarily imply low generalization error. However, it does not necessarily 
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imply high generalization error either! All we can say with confidence is that low training 
error alone is not enough to certify low generalization error. Deep neural networks turn 
out to be just such models: while they generalize well in practice, they are too powerful 
to allow us to conclude much on the basis of training error alone. In these cases we must 
rely more heavily on our holdout data to certify generalization after the fact. Error on the 
holdout data, i.e., validation set, is called the validation error. 


3.6.2 Underfitting or Overfitting? 


When we compare the training and validation errors, we want to be mindful of two com- 
mon situations. First, we want to watch out for cases when our training error and validation 
error are both substantial but there is a little gap between them. If the model is unable to 
reduce the training error, that could mean that our model is too simple (i.e., insufficiently 
expressive) to capture the pattern that we are trying to model. Moreover, since the gener- 
alization gap (Remp — R) between our training and generalization errors is small, we have 
reason to believe that we could get away with a more complex model. This phenomenon is 
known as underfitting. 


On the other hand, as we discussed above, we want to watch out for the cases when our 
training error is significantly lower than our validation error, indicating severe overfitting. 
Note that overfitting is not always a bad thing. In deep learning especially, the best pre- 
dictive models often perform far better on training data than on holdout data. Ultimately, 
we usually care about driving the generalization error lower, and only care about the gap 
insofar as it becomes an obstacle to that end. Note that if the training error is zero, then the 
generalization gap is precisely equal to the generalization error and we can make progress 
only by reducing the gap. 


Polynomial Curve Fitting 


To illustrate some classical intuition about overfitting and model complexity, consider the 
following: given training data consisting of a single feature x and a corresponding real- 
valued label y, we try to find the polynomial of degree d 


d 
$=) xw (3.6.3) 
i=0 


for estimating the label y. This is just a linear regression problem where our features are 
given by the powers of x, the model’s weights are given by w;, and the bias is given by wọ 
since x° = 1 for all x. Since this is just a linear regression problem, we can use the squared 
error as our loss function. 


A higher-order polynomial function is more complex than a lower-order polynomial func- 
tion, since the higher-order polynomial has more parameters and the model function’s selec- 
tion range is wider. Fixing the training dataset, higher-order polynomial functions should 
always achieve lower (at worst, equal) training error relative to lower-degree polynomials. 
In fact, whenever each data example has a distinct value of x, a polynomial function with 
degree equal to the number of data examples can fit the training set perfectly. We compare 
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the relationship between polynomial degree (model complexity) and both underfitting and 
overfitting in Fig. 3.6.1. 


+ — 
Underfitting Optimum Overfitting 


A 
Loss 


Generalization loss 


Training loss 


Model complexity 


Influence of model complexity on underfitting and overfitting. 


Dataset Size 


As the above bound already indicates, another big consideration to bear in mind is dataset 
size. Fixing our model, the fewer samples we have in the training dataset, the more likely 
(and more severely) we are to encounter overfitting. As we increase the amount of training 
data, the generalization error typically decreases. Moreover, in general, more data never 
hurts. For a fixed task and data distribution, model complexity should not increase more 
rapidly than the amount of data. Given more data, we might attempt to fit a more complex 
model. Absent sufficient data, simpler models may be more difficult to beat. For many 
tasks, deep learning only outperforms linear models when many thousands of training ex- 
amples are available. In part, the current success of deep learning owes considerably to the 
abundance of massive datasets arising from Internet companies, cheap storage, connected 
devices, and the broad digitization of the economy. 


3.6.3 Model Selection 


Typically, we select our final model only after evaluating multiple models that differ in vari- 
ous ways (different architectures, training objectives, selected features, data preprocessing, 
learning rates, etc.). Choosing among many models is aptly called model selection. 


In principle, we should not touch our test set until after we have chosen all our hyperpa- 
rameters. Were we to use the test data in the model selection process, there is a risk that we 
might overfit the test data. Then we would be in serious trouble. If we overfit our training 
data, there is always the evaluation on test data to keep us honest. But if we overfit the test 
data, how would we ever know? See Ong et al. (2005) for an example of how this can lead 
to absurd results even for models where the complexity can be tightly controlled. 


Thus, we should never rely on the test data for model selection. And yet we cannot rely 
solely on the training data for model selection either because we cannot estimate the gen- 
eralization error on the very data that we use to train the model. 


In practical applications, the picture gets muddier. While ideally we would only touch the 
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test data once, to assess the very best model or to compare a small number of models with 
each other, real-world test data is seldom discarded after just one use. We can seldom 
afford a new test set for each round of experiments. In fact, recycling benchmark data for 
decades can have a significant impact on the development of algorithms, e.g., for image 


classification®! and optical character recognition ®?. 


; The common practice for addressing the problem of training on the test set is to split our 


data three ways, incorporating a validation set in addition to the training and test datasets. 
The result is a murky business where the boundaries between validation and test data are 
worryingly ambiguous. Unless explicitly stated otherwise, in the experiments in this book 
we are really working with what should rightly be called training data and validation data, 
with no true test sets. Therefore, the accuracy reported in each experiment of the book is 
really the validation accuracy and not a true test set accuracy. 


Cross- Validation 


When training data is scarce, we might not even be able to afford to hold out enough data to 
constitute a proper validation set. One popular solution to this problem is to employ K-fold 
cross-validation. Here, the original training data is split into K non-overlapping subsets. 
Then model training and validation are executed K times, each time training on K — 1 
subsets and validating on a different subset (the one not used for training in that round). 
Finally, the training and validation errors are estimated by averaging over the results from 
the K experiments. 


3.6.4 Summary 


This section explored some of the underpinnings of generalization in machine learning. 
Some of these ideas become complicated and counterintuitive when we get to deeper mod- 
els; here, models are capable of overfitting data badly, and the relevant notions of complex- 
ity can be both implicit and counterintuitive (e.g., larger architectures with more parameters 
generalizing better). We leave you with a few rules of thumb: 


1. Use validation sets (or K-fold cross-validation) for model selection; 
2. More complex models often require more data; 


3. Relevant notions of complexity include both the number of parameters and the range of 
values that they are allowed to take; 


4. Keeping all else equal, more data almost always leads to better generalization; 


5. This entire talk of generalization is all predicated on the IID assumption. If we relax 
this assumption, allowing for distributions to shift between the train and testing peri- 
ods, then we cannot say anything about generalization absent a further (perhaps milder) 
assumption. 


3.6.5 Exercises 


1. When can you solve the problem of polynomial regression exactly? 
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2. Give at least five examples where dependent random variables make treating the problem 
as IID data inadvisable. 


3. Can you ever expect to see zero training error? Under which circumstances would you 
see zero generalization error? 


4. Why is K-fold cross-validation very expensive to compute? 
5. Why is the K-fold cross-validation error estimate biased? 


6. The VC dimension is defined as the maximum number of points that can be classified 
with arbitrary labels {+1} by a function of a class of functions. Why might this not be 
a good idea for measuring how complex the class of functions is? Hint: consider the 
magnitude of the functions. 


7. Your manager gives you a difficult dataset on which your current algorithm does not 
perform so well. How would you justify to him that you need more data? Hint: you 
cannot increase the data but you can decrease it. 


Discussions ®? . 


3.7 Weight Decay 


Now that we have characterized the problem of overfitting, we can introduce our first reg- 
ularization technique. Recall that we can always mitigate overfitting by collecting more 
training data. However, that can be costly, time consuming, or entirely out of our control, 
making it impossible in the short run. For now, we can assume that we already have as 
much high-quality data as our resources permit and focus the tools at our disposal when the 
dataset is taken as a given. 


Recall that in our polynomial regression example (Section 3.6.2) we could limit our model’s 
capacity by tweaking the degree of the fitted polynomial. Indeed, limiting the number of 
features is a popular technique for mitigating overfitting. However, simply tossing aside 
features can be too blunt an instrument. Sticking with the polynomial regression example, 
consider what might happen with high-dimensional input. The natural extensions of poly- 
nomials to multivariate data are called monomials, which are simply products of powers 
of variables. The degree of a monomial is the sum of the powers. For example, x? X2, and 
x3x2 are both monomials of degree 3. 


Note that the number of terms with degree d blows up rapidly as d grows larger. Given k 
variables, the number of monomials of degree d is Ce Even small changes in degree, 
say from 2 to 3, dramatically increase the complexity of our model. Thus we often need a 


more fine-grained tool for adjusting function complexity. 
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%matplotlib inline 

import torch 

from torch import nn 

from d21 import torch as d21 


3.7.1 Norms and Weight Decay 


Rather than directly manipulating the number of parameters, weight decay, operates by 
restricting the values that the parameters can take. More commonly called € regularization 
outside of deep learning circles when optimized by minibatch stochastic gradient descent, 
weight decay might be the most widely used technique for regularizing parametric machine 
learning models. The technique is motivated by the basic intuition that among all functions 
f, the function f = 0 (assigning the value 0 to all inputs) is in some sense the simplest, and 
that we can measure the complexity of a function by the distance of its parameters from 
zero. But how precisely should we measure the distance between a function and zero? 
There is no single right answer. In fact, entire branches of mathematics, including parts 
of functional analysis and the theory of Banach spaces, are devoted to addressing such 
issues. 


One simple interpretation might be to measure the complexity of a linear function f(x) = 
wx by some norm of its weight vector, e.g., || w/||?. Recall that we introduced the f2 norm 
and fı norm, which are special cases of the more general £, norm, in Section 2.3.11. The 
most common method for ensuring a small weight vector is to add its norm as a penalty 
term to the problem of minimizing the loss. Thus we replace our original objective, min- 
imizing the prediction loss on the training labels, with new objective, minimizing the sum 
of the prediction loss and the penalty term. Now, if our weight vector grows too large, our 
learning algorithm might focus on minimizing the weight norm ||w||* rather than minimiz- 
ing the training error. That is exactly what we want. To illustrate things in code, we revive 
our previous example from Section 3.1 for linear regression. There, our loss was given 
by 
n 
E X > (wrx +b- yoy l (3.7.1) 
ne 2 

Recall that x‘ are the features, y is the label for any data example i, and (w, b) are 
the weight and bias parameters, respectively. To penalize the size of the weight vector, 
we must somehow add ||w]|? to the loss function, but how should the model trade off the 
standard loss for this new additive penalty? In practice, we characterize this trade-off via 
the regularization constant A, a nonnegative hyperparameter that we fit using validation 
data: 


à 
L(w,b) + zlIwI?. (3.7.2) 


For 4 = 0, we recover our original loss function. For 2 > 0, we restrict the size of ||w/||. 
We divide by 2 by convention: when we take the derivative of a quadratic function, the 
2 and 1/2 cancel out, ensuring that the expression for the update looks nice and simple. 
The astute reader might wonder why we work with the squared norm and not the standard 
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norm (i.e., the Euclidean distance). We do this for computational convenience. By squaring 
the £2 norm, we remove the square root, leaving the sum of squares of each component of 
the weight vector. This makes the derivative of the penalty easy to compute: the sum of 
derivatives equals the derivative of the sum. 


Moreover, you might ask why we work with the f2 norm in the first place and not, say, 
the £; norm. In fact, other choices are valid and popular throughout statistics. While £2- 
regularized linear models constitute the classic ridge regression algorithm, ¢,-regularized 
linear regression is a similarly fundamental method in statistics, popularly known as lasso 
regression. One reason to work with the f2 norm is that it places an outsize penalty on large 
components of the weight vector. This biases our learning algorithm towards models that 
distribute weight evenly across a larger number of features. In practice, this might make 
them more robust to measurement error in a single variable. By contrast, f; penalties lead 
to models that concentrate weights on a small set of features by clearing the other weights 
to zero. This gives us an effective method for feature selection, which may be desirable for 
other reasons. For example, if our model only relies on a few features, then we may not 
need to collect, store, or transmit data for the other (dropped) features. 


Using the same notation in (3.1.11), minibatch stochastic gradient descent updates for f- 
regularized regression as follows: 


we (l-nl)w- LY x (wrx +b—y) . (3.7.3) 
k ic¢B 

As before, we update w based on the amount by which our estimate differs from the ob- 
servation. However, we also shrink the size of w towards zero. That is why the method is 
sometimes called “weight decay”: given the penalty term alone, our optimization algorithm 
decays the weight at each step of training. In contrast to feature selection, weight decay 
offers us a mechanism for continuously adjusting the complexity of a function. Smaller 
values of A correspond to less constrained w, whereas larger values of 4 constrain w more 
considerably. Whether we include a corresponding bias penalty b? can vary across imple- 
mentations, and may vary across layers of a neural network. Often, we do not regularize 
the bias term. Besides, although f2 regularization may not be equivalent to weight decay 
for other optimization algorithms, the idea of regularization through shrinking the size of 
weights still holds true. 


3.7.2 High-Dimensional Linear Regression 


We can illustrate the benefits of weight decay through a simple synthetic example. 


First, we generate some data as before: 


d 
y = 0.05 +)” 0.01x; + € where € ~ N(0,0.01°). (3.7.4) 

i=l 
In this synthetic dataset, our label is given by an underlying linear function of our inputs, 
corrupted by Gaussian noise with zero mean and standard deviation 0.01. For illustrative 
purposes, we can make the effects of overfitting pronounced, by increasing the dimen- 


121 Weight Decay 


sionality of our problem to d = 200 and working with a small training set with only 20 
examples. 


class Data(d21.DataModule) : 
def __init__(self, num_train, num_val, num_inputs, batch_size): 
self.save_hyperparameters() 
n = num_train + num_val 
self.X = torch.randn(n, num_inputs) 
noise = torch.randn(n, 1) * 0.01 
w, b = torch.ones((num_inputs, 1)) * 0.01, 0.05 
self.y = torch.matmul(self.X, w) + b + noise 


def get_dataloader(self, train): 
i = slice(®, self.num_train) if train else slice(self.num_train, None) 
return self.get_tensorloader([self.X, self.y], train, i) 


3.7.3 Implementation from Scratch 


Now, let’s try implementing weight decay from scratch. Since minibatch stochastic gradient 
descent is our optimizer, we just need to add the squared f2 penalty to the original loss 
function. 


Defining £2 Norm Penalty 


Perhaps the most convenient way of implementing this penalty is to square all terms in 
place and sum them. 


def 12_penalty(w): 
return (w ** 2).sum() / 2 


Defining the Model 


In the final model, the linear regression and the squared loss have not changed since Section 
3.4, so we will just define a subclass of d21.LinearRegressionScratch. The only change 
here is that our loss now includes the penalty term. 


class WeightDecayScratch(d21.LinearRegressionScratch) : 
def __init__(self, num_inputs, lambd, lr, sigma=0.01): 
super().__init__(num_inputs, lr, sigma) 
self.save_hyperparameters() 


def loss(self, y_hat, y): 
return (super().loss(y_hat, y) + 
self.lambd x 12_penalty(self.w)) 


The following code fits our model on the training set with 20 examples and evaluates it on 
the validation set with 100 examples. 
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data = Data(num_train=20, num_val=100, num_inputs=200, batch_size=5) 
trainer = d21.Trainer(max_epochs=10) 


def train_scratch(lambd): 
model = WeightDecayScratch(num_inputs=200, lambd=lambd, 1r=0.01) 
model. board. yscale=' log’ 
trainer.fit(model, data) 
print(’L2 norm of w:', float(12_penalty (model .w))) 


Training without Regularization 


We now run this code with lambd = Q, disabling weight decay. Note that we overfit 
badly, decreasing the training error but not the validation error—a textbook case of over- 
fitting. 


train_scratch(@) 


L2 norm of w: @.009948714636266232 


10-2 4 


10-3 4 


| — train_loss 
=== val loss 


Using Weight Decay 


Below, we run with substantial weight decay. Note that the training error increases but 
the validation error decreases. This is precisely the effect we expect from regulariza- 
tion. 


train_scratch(3) 


L2 norm of w: @.0017270983662456274 


3.7.4 Concise Implementation 


Because weight decay is ubiquitous in neural network optimization, the deep learning 
framework makes it especially convenient, integrating weight decay into the optimization 
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algorithm itself for easy use in combination with any loss function. Moreover, this integra- 
tion serves a computational benefit, allowing implementation tricks to add weight decay 
to the algorithm, without any additional computational overhead. Since the weight decay 
portion of the update depends only on the current value of each parameter, the optimizer 
must touch each parameter once anyway. 


Below, we specify the weight decay hyperparameter directly through weight_decay when 
instantiating our optimizer. By default, PyTorch decays both weights and biases simulta- 
neously, but we can configure the optimizer to handle different parameters according to 
different policies. Here, we only set weight_decay for the weights (the net .weight pa- 
rameters), hence the bias (the net .bias parameter) will not decay. 


class WeightDecay(d21.LinearRegression) : 
def __init__(self, wd, Ir): 


super().__init__(1r) 
self.save_hyperparameters() 
self.wd = wd 


def configure_optimizers(self): 
return torch.optim. SGD([ 
{'params’: self.net.weight, 'weight_decay’: self.wd}, 
{'params’: self.net.bias}], lr=self.1r) 


The plot looks similar to that when we implemented weight decay from scratch. How- 
ever, this version runs faster and is easier to implement, benefits that will become more 
pronounced as you address larger problems and this work becomes more routine. 


model = WeightDecay(wd=3, 1r=0.01) 
model .board. yscale=' log’ 
trainer.fit(model, data) 


print(’L2 norm of w:', float(12_penalty(model.get_w_b()[@]))) 


L2 norm of w: @.013779522851109505 


So far, we have touched upon one notion of what constitutes a simple linear function. How- 
ever, even for simple nonlinear functions, the situation can be much more complex. To see 
this, the concept of reproducing kernel Hilbert space (RKHS)** allows one to apply tools 
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introduced for linear functions in a nonlinear context. Unfortunately, RKHS-based algo- 
rithms tend to scale poorly to large, high-dimensional data. In this book we will often 
adopt the common heuristic whereby weight decay is applied to all layers of a deep net- 
work. 


3.7.5 Summary 


Regularization is a common method for dealing with overfitting. Classical regularization 
techniques add a penalty term to the loss function (when training) to reduce the complexity 
of the learned model. One particular choice for keeping the model simple is using an £2 
penalty. This leads to weight decay in the update steps of the minibatch stochastic gradient 
descent algorithm. In practice, the weight decay functionality is provided in optimizers 
from deep learning frameworks. Different sets of parameters can have different update 
behaviors within the same training loop. 


3.7.6 Exercises 


1. Experiment with the value of 4 in the estimation problem in this section. Plot training 
and validation accuracy as a function of 4. What do you observe? 


2. Use a validation set to find the optimal value of 4. Is it really the optimal value? Does 
this matter? 


3. What would the update equations look like if instead of ||w||? we used J}; |w;| as our 
penalty of choice (£; regularization)? 


4. We know that ||w||?7 = w'w. Can you find a similar equation for matrices (see the 
Frobenius norm in Section 2.3.11)? 


5. Review the relationship between training error and generalization error. In addition to 
weight decay, increased training, and the use of a model of suitable complexity, what 
other ways might help us deal with overfitting? 


6. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior 
via P(w | x) « P(x | w)P(w). How can you identify P(w) with regularization? 


Discussions®°. 


4 


125 


Linear Neural Networks for Classification 


Now that you have worked through all of the mechanics you are ready to apply the skills 
you have learned to broader kinds of tasks. Even as we pivot towards classification, most 
of the plumbing remains the same: loading the data, passing it through the model, generat- 
ing output, calculating the loss, taking gradients with respect to weights, and updating the 
model. However, the precise form of the targets, the parametrization of the output layer, 
and the choice of loss function will adapt to suit the classification setting. 


4.1 Softmax Regression 
XS™]TuH—1"_o"uo@4oo ooo“ w.*—v0N0NN7N)0>°N NNNMqOVoT—V_J_J_) 


In Section 3.1, we introduced linear regression, working through implementations from 
scratch in Section 3.4 and again using high-level APIs of a deep learning framework in 
Section 3.5 to do the heavy lifting. 


Regression is the hammer we reach for when we want to answer how much? or how many? 
questions. If you want to predict the number of dollars (price) at which a house will be sold, 
or the number of wins a baseball team might have, or the number of days that a patient will 
remain hospitalized before being discharged, then you are probably looking for a regression 
model. However, even within regression models, there are important distinctions. For 
instance, the price of a house will never be negative and changes might often be relative 
to its baseline price. As such, it might be more effective to regress on the logarithm of the 
price. Likewise, the number of days a patient spends in hospital is a discrete nonnegative 
random variable. As such, least mean squares might not be an ideal approach either. This 
sort of time-to-event modeling comes with a host of other complications that are dealt with 
in a specialized subfield called survival modeling. 


The point here is not to overwhelm you but just to let you know that there is a lot more 
to estimation than simply minimizing squared errors. And more broadly, there is a lot 
more to supervised learning than regression. In this section, we focus on classification 
problems where we put aside how much? questions and instead focus on which category? 
questions. 


e Does this email belong in the spam folder or the inbox? 


e Is this customer more likely to sign up or not to sign up for a subscription service? 
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e Does this image depict a donkey, a dog, a cat, or a rooster? 
e Which movie is Aston most likely to watch next? 
e Which section of the book are you going to read next? 


Colloquially, machine learning practitioners overload the word classification to describe 
two subtly different problems: (i) those where we are interested only in hard assignments 
of examples to categories (classes); and (ii) those where we wish to make soft assignments, 
i.e., to assess the probability that each category applies. The distinction tends to get blurred, 
in part, because often, even when we only care about hard assignments, we still use models 
that make soft assignments. 


Even more, there are cases where more than one label might be true. For instance, a news 
article might simultaneously cover the topics of entertainment, business, and space flight, 
but not the topics of medicine or sports. Thus, categorizing it into one of the above cate- 
gories on their own would not be very useful. This problem is commonly known as multi- 
label classification 8. See Tsoumakas and Katakis (2007) for an overview and Huang et 
~ al. (2015) for an effective algorithm when tagging images. 


4.1.1 Classification 


To get our feet wet, let’s start with a simple image classification problem. Here, each input 
consists of a 2 x 2 grayscale image. We can represent each pixel value with a single scalar, 
giving us four features x1, x2, x3, X4. Further, let’s assume that each image belongs to one 
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among the categories “cat”, “chicken”, and “dog”. 


Next, we have to choose how to represent the labels. We have two obvious choices. Per- 
haps the most natural impulse would be to choose y € {1, 2,3}, where the integers represent 
{dog, cat, chicken} respectively. This is a great way of storing such information on a com- 
puter. If the categories had some natural ordering among them, say if we were trying to 
predict {baby, toddler, adolescent, young adult, adult, geriatric}, then it might even make 
sense to cast this as an ordinal regression ê” problem and keep the labels in this format. 
See Moon et al. (2010) for an overview of different types of ranking loss functions and 
Beutel et al. (2014) for a Bayesian approach that addresses responses with more than one 
mode. 


In general, classification problems do not come with natural orderings among the classes. 
Fortunately, statisticians long ago invented a simple way to represent categorical data: the 
one-hot encoding. A one-hot encoding is a vector with as many components as we have 
categories. The component corresponding to a particular instance’s category is set to | and 
all other components are set to 0. In our case, a label y would be a three-dimensional vector, 
with (1, 0,0) corresponding to “cat”, (0, 1,0) to “chicken”, and (0, 0, 1) to “dog”: 


y € {(1,0,0), (0, 1,0), (0, 0, 1)}. (4.1.1) 
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Linear Model 


In order to estimate the conditional probabilities associated with all the possible classes, 
we need a model with multiple outputs, one per class. To address classification with lin- 
ear models, we will need as many affine functions as we have outputs. Strictly speaking, 
we only need one fewer, since the final category has to be the difference between | and 
the sum of the other categories, but for reasons of symmetry we use a slightly redundant 
parametrization. Each output corresponds to its own affine function. In our case, since 
we have 4 features and 3 possible output categories, we need 12 scalars to represent the 
weights (w with subscripts), and 3 scalars to represent the biases (b with subscripts). This 
yields: 


01 = XW +X2W12 +X3W13 + X4W14 + b1, 
02 = X1 W21 + X2W22 + X3 W23 +x4Ww24 + bo, (4.1.2) 
03 = X1W31 + X2W32 + X3W33 + X4W34 + b3. 
The corresponding neural network diagram is shown in Fig. 4.1.1. Just as in linear regres- 
sion, we use a single-layer neural network. And since the calculation of each output, 01, 02, 


and 03, depends on every input, x1, x2, x3, and x4, the output layer can also be described as 
a fully connected layer. 


Output layer 


Input layer 


Softmax regression is a single-layer neural network. 


For a more concise notation we use vectors and matrices: o = Wx+hb is much better suited 
for mathematics and code. Note that we have gathered all of our weights into a 3 x 4 matrix 
and all biases b € R? in a vector. 


The Softmax 


Assuming a suitable loss function, we could try, directly, to minimize the difference be- 
tween o and the labels y. While it turns out that treating classification as a vector-valued 
regression problem works surprisingly well, it is nonetheless unsatisfactory in the following 
ways: 


e There is no guarantee that the outputs o; sum up to 1 in the way we expect probabilities 
to behave. 


e There is no guarantee that the outputs o; are even nonnegative, even if their outputs sum 
up to 1, or that they do not exceed 1. 


Both aspects render the estimation problem difficult to solve and the solution very brittle 
to outliers. For instance, if we assume that there is a positive linear dependency between 
the number of bedrooms and the likelihood that someone will buy a house, the probability 


128 


Linear Neural Networks for Classification 


might exceed 1 when it comes to buying a mansion! As such, we need a mechanism to 
“squish” the outputs. 


There are many ways we might accomplish this goal. For instance, we could assume that 
the outputs o are corrupted versions of y, where the corruption occurs by means of adding 
noise € drawn from a normal distribution. In other words, y = o + €, where e; ~ N(0, a’). 
This is the so-called probit model®®, first introduced by Fechner (1860). While appealing, 
it does not work quite as well nor lead to a particularly nice optimization problem, when 


* compared to the softmax. 


Another way to accomplish this goal (and to ensure nonnegativity) is to use an exponential 
function P(y = i) œ expo;. This does indeed satisfy the requirement that the conditional 
class probability increases with increasing 0;, it is monotonic, and all probabilities are 
nonnegative. We can then transform these values so that they add up to 1 by dividing each 
by their sum. This process is called normalization. Putting these two pieces together gives 
us the softmax function: 


. _  exp(o;) 
Sj exp(o;)’ 


Note that the largest coordinate of o corresponds to the most likely class according to ¥. 
Moreover, because the softmax operation preserves the ordering among its arguments, we 
do not need to compute the softmax to determine which class has been assigned the highest 
probability. Thus, 


¥ = softmax(o) where (4.1.3) 


argmax Jj = argmax oj. (4.1.4) 
j j 

The idea of a softmax dates back to Gibbs (1902), who adapted ideas from physics. Dating 
even further back, Boltzmann, the father of modern statistical physics, used this trick to 
model a distribution over energy states in gas molecules. In particular, he discovered that 
the prevalence of a state of energy in a thermodynamic ensemble, such as the molecules in a 
gas, is proportional to exp(—E/kT). Here, E is the energy of a state, T is the temperature, 
and k is the Boltzmann constant. When statisticians talk about increasing or decreasing 
the “temperature” of a statistical system, they refer to changing T in order to favor lower 
or higher energy states. Following Gibbs’ idea, energy equates to error. Energy-based 
models (Ranzato et al., 2007) use this point of view when describing problems in deep 
learning. 


Vectorization 


To improve computational efficiency, we vectorize calculations in minibatches of data. As- 
sume that we are given a minibatch X € R”*4 of n examples with dimensionality (number 
of inputs) d. Moreover, assume that we have q categories in the output. Then the weights 
satisfy W € R?%4 and the bias satisfies b € R!*4. 


O=XW +b, 
2 (4.1.5) 
Y = softmax (O). 
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This accelerates the dominant operation into a matrix—matrix product XW. Moreover, 
since each row in X represents a data example, the softmax operation itself can be computed 
rowwise: for each row of O, exponentiate all entries and then normalize them by the sum. 
Note, though, that care must be taken to avoid exponentiating and taking logarithms of large 
numbers, since this can cause numerical overflow or underflow. Deep learning frameworks 
take care of this automatically. 


4.1.2 Loss Function 


Now that we have a mapping from features x to probabilities ¥, we need a way to optimize 
the accuracy of this mapping. We will rely on maximum likelihood estimation, the very 
same method that we encountered when providing a probabilistic justification for the mean 
squared error loss in Section 3.1.3. 


Log-Likelihood 


The softmax function gives us a vector ¥, which we can interpret as the (estimated) con- 
ditional probabilities of each class, given any input x, such as $; = P(y = cat | x). In the 
following we assume that for a dataset with features X the labels Y are represented using 
a one-hot encoding label vector. We can compare the estimates with reality by checking 
how probable the actual classes are according to our model, given the features: 


P(Y | X)= I] P(y® | x), (4.1.6) 
i=l 
We are allowed to use the factorization since we assume that each label is drawn indepen- 
dently from its respective distribution P(y | xl )), Since maximizing the product of terms 
is awkward, we take the negative logarithm to obtain the equivalent problem of minimizing 
the negative log-likelihood: 
-log P(Y | X) = $} -log P(y | x) = $ 10,9) (41.7) 
i=l i=1 
where for any pair of label y and model prediction ¥ over q classes, the loss function / 
is 


q 
iy, 9) =- yjlogs;. (4.1.8) 
j=l 
For reasons explained later on, the loss function in (4.1.8) is commonly called the cross- 
entropy loss. Since y is a one-hot vector of length q, the sum over all its coordinates j van- 
ishes for all but one term. Note that the loss /(y, ¥) is bounded from below by 0 whenever ¥ 
is a probability vector: no single entry is larger than 1, hence their negative logarithm can- 
not be lower than 0; /(y, ¥) = 0 only if we predict the actual label with certainty. This can 
never happen for any finite setting of the weights because taking a softmax output towards 
1 requires taking the corresponding input o; to infinity (or all other outputs o; for j + i 
to negative infinity). Even if our model could assign an output probability of 0, any error 
made when assigning such high confidence would incur infinite loss (— log 0 = oo). 
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Softmax and Cross-Entropy Loss 


Since the softmax function and the corresponding cross-entropy loss are so common, it is 
worth understanding a bit better how they are computed. Plugging (4.1.3) into the defini- 
tion of the loss in (4.1.8) and using the definition of the softmax we obtain 


4 exp(0;) 
ly. 9) =- Dy; log y — 
2» 4_ exp(ox) 
qd 


q4 q4 
= Xoi log $ exp(ox) - X yjoj (4.1.9) 
j=l j=l 


k=l 
q q 
= log > exp(0x) — > yjoj. 
k=l j=l 


To understand a bit better what is going on, consider the derivative with respect to any logit 
oj. We get 


exp(o;) 


— y; = softmax(o); — yj. 4.1.10 
a exp(ox) j JT Yj ( ) 


do p ly, y) = 
In other words, the derivative is the difference between the probability assigned by our 
model, as expressed by the softmax operation, and what actually happened, as expressed 
by elements in the one-hot label vector. In this sense, it is very similar to what we saw in 
regression, where the gradient was the difference between the observation y and estimate 
$. This is not a coincidence. In any exponential family model, the gradients of the log- 
likelihood are given by precisely this term. This fact makes computing gradients easy in 
practice. 


Now consider the case where we observe not just a single outcome but an entire distribution 
over outcomes. We can use the same representation as before for the label y. The only dif- 
ference is that rather than a vector containing only binary entries, say (0,0, 1), we now have 
a generic probability vector, say (0.1, 0.2, 0.7). The math that we used previously to define 
the loss / in (4.1.8) still works well, just that the interpretation is slightly more general. It 
is the expected value of the loss for a distribution over labels. This loss is called the cross- 
entropy loss and it is one of the most commonly used losses for classification problems. We 
can demystify the name by introducing just the basics of information theory. In a nutshell, 
it measures the number of bits needed to encode what we see, y, relative to what we predict 
that should happen, y. We provide a very basic explanation in the following. For further 
details on information theory see Cover and Thomas (1999) or MacKay (2003). 


4.1.3 Information Theory Basics 


Many deep learning papers use intuition and terms from information theory. To make sense 
of them, we need some common language. This is a survival guide. Information theory 
deals with the problem of encoding, decoding, transmitting, and manipulating information 
(also known as data). 
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Entropy 


The central idea in information theory is to quantify the amount of information contained 
in data. This places a limit on our ability to compress data. For a distribution P its entropy, 
H[P], is defined as: 


H[P] = )\—P(j) log PU). (4.1.11) 


J 


One of the fundamental theorems of information theory states that in order to encode data 
drawn randomly from the distribution P, we need at least H[P] “nats” to encode it (Shan- 
non, 1948). If you wonder what a “nat” is, it is the equivalent of bit but when using a code 
with base e rather than one with base 2. Thus, one nat is og) zx 1.44 bit. 


Surprisal 


You might be wondering what compression has to do with prediction. Imagine that we have 
a stream of data that we want to compress. If it is always easy for us to predict the next 
token, then this data is easy to compress. Take the extreme example where every token in 
the stream always takes the same value. That is a very boring data stream! And not only 
it is boring, but it is also easy to predict. Because the tokens are always the same, we do 
not have to transmit any information to communicate the contents of the stream. Easy to 
predict, easy to compress. 


However if we cannot perfectly predict every event, then we might sometimes be surprised. 
Our surprise is greater when an event is assigned lower probability. Claude Shannon settled 
on log Puy = — log P(j) to quantify one’s surprisal at observing an event j having assigned 
it a (subjective) probability P(j). The entropy defined in (4.1.11) is then the expected 
surprisal when one assigned the correct probabilities that truly match the data-generating 
process. 


Cross-Entropy Revisited 


So if entropy is the level of surprise experienced by someone who knows the true proba- 
bility, then you might be wondering, what is cross-entropy? The cross-entropy from P to 
Q, denoted H(P, Q), is the expected surprisal of an observer with subjective probabilities 


Q upon seeing data that was actually generated according to probabilities P. This is given 


by H(P, Q) = 2; —P(j) log Q(j). The lowest possible cross-entropy is achieved when 


P = Q. In this case, the cross-entropy from P to Q is H(P, P) = H(P). 


In short, we can think of the cross-entropy classification objective in two ways: (1) as max- 
imizing the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus 
the number of bits) required to communicate the labels. 


4.1.4 Summary and Discussion 


In this section, we encountered the first nontrivial loss function, allowing us to optimize over 
discrete output spaces. Key in its design was that we took a probabilistic approach, treating 
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discrete categories as instances of draws from a probability distribution. As a side effect, 
we encountered the softmax, a convenient activation function that transforms outputs of an 
ordinary neural network layer into valid discrete probability distributions. We saw that the 
derivative of the cross-entropy loss when combined with softmax behaves very similarly 
to the derivative of squared error; namely by taking the difference between the expected 
behavior and its prediction. And, while we were only able to scratch the very surface of it, 
we encountered exciting connections to statistical physics and information theory. 


While this is enough to get you on your way, and hopefully enough to whet your appetite, 
we hardly dived deep here. Among other things, we skipped over computational con- 
siderations. Specifically, for any fully connected layer with d inputs and q outputs, the 
parametrization and computational cost is O(dq), which can be prohibitively high in prac- 
tice. Fortunately, this cost of transforming d inputs into q outputs can be reduced through 
approximation and compression. For instance Deep Fried Convnets (Yang et al., 2015) 
uses a combination of permutations, Fourier transforms, and scaling to reduce the cost 
from quadratic to log-linear. Similar techniques work for more advanced structural matrix 
approximations (Sindhwani et al., 2015). Lastly, we can use quaternion-like decomposi- 
tions to reduce the cost to o(44), again if we are willing to trade off a small amount of 
accuracy for computational and storage cost (Zhang et al., 2021) based on a compression 
factor n. This is an active area of research. What makes it challenging is that we do not 
necessarily strive for the most compact representation or the smallest number of floating 
point operations but rather for the solution that can be executed most efficiently on modern 
GPUs. 


4.1.5 Exercises 


1. We can explore the connection between exponential families and softmax in some more 
depth. 


1. Compute the second derivative of the cross-entropy loss L(y, ¥) for softmax. 


2. Compute the variance of the distribution given by softmax(o) and show that it matches 
the second derivative computed above. 


2. Assume that we have three classes which occur with equal probability, i.e., the proba- 
bility vector is G, i 5). 
1. What is the problem if we try to design a binary code for it? 


2. Can you design a better code? Hint: what happens if we try to encode two indepen- 
dent observations? What if we encode n observations jointly? 


. When encoding signals transmitted over a physical wire, engineers do not always use 
binary codes. For instance, PAM-3®° uses three signal levels {—1,0, 1} as opposed to 
two levels {0,1}. How many ternary units do you need to transmit an integer in the 
range {0,...,7}? Why might this be a better idea in terms of electronics? 


4. The Bradley—Terry model’? uses a logistic model to capture preferences. For a user to 


133 


Softmax Regression 


choose between apples and oranges one assumes scores Oapple ANd Oorange. Our require- 
ments are that larger scores should lead to a higher likelihood in choosing the associated 
item and that the item with the largest score is the most likely one to be chosen (Bradley 
and Terry, 1952). 


1. Prove that softmax satisfies this requirement. 


2. What happens if you want to allow for a default option of choosing neither apples 
nor oranges? Hint: now the user has three choices. 


. Softmax gets its name from the following mapping: RealSoftMax(a, b) = log(exp(a) + 


exp(b)). 
1. Prove that RealSoftMax(a, b) > max(a, b). 


2. How small can you make the difference between both functions? Hint: without loss 
of generality you can set b = O anda > b. 


3. Prove that this holds for A~'!RealSoftMax(Aa, Ab), provided that 2 > 0. 
4. Show that for 2 — co we have A~!RealSoftMax(Aa, Ab) —> max(a, b). 
5. Construct an analogous softmin function. 


6. Extend this to more than two numbers. 


Í def ; : of 
. The function g(x) = log >); exp x; is sometimes also referred to as the log-partition 


function?! . 


1. Prove that the function is convex. Hint: to do so, use the fact that the first derivative 
amounts to the probabilities from the softmax function and show that the second 
derivative is the variance. 


2. Show that g is translation invariant, i.e., g(x + b) = g(x). 


3. What happens if some of the coordinates x; are very large? What happens if they’re 
all very small? 


4. Show that if we choose b = max;x; we end up with a numerically stable implemen- 
tation. 


. Assume that we have some probability distribution P. Suppose we pick another distri- 


bution Q with Q (i) « P(i)® for a > 0. 


1. Which choice of œ corresponds to doubling the temperature? Which choice corre- 
sponds to halving it? 


2. What happens if we let the temperature approach 0? 


3. What happens if we let the temperature approach 00? 


92 
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4.2 The Image Classification Dataset 


One widely used dataset for image classification is the MNIST dataset?’ (LeCun et al., 


a 1998) of handwritten digits. At the time of its release in the 1990s it posed a formidable 
' challenge to most machine learning algorithms, consisting of 60,000 images of 28 x 28 


pixels resolution (plus a test dataset of 10,000 images). To put things into perspective, back 
in 1995, a Sun SPARCStation 5 with a whopping 64MB of RAM and a blistering 5 MFLOPs 
was considered state of the art equipment for machine learning at AT&T Bell Laboratories. 
Achieving high accuracy on digit recognition was a key component in automating letter 
sorting for the USPS in the 1990s. Deep networks such as LeNet-5 (LeCun et al., 1995), 
support vector machines with invariances (Schélkopf et al., 1996), and tangent distance 
classifiers (Simard et al., 1998) all could reach error rates below 1%. 


For over a decade, MNIST served as the point of reference for comparing machine learn- 
ing algorithms. While it had a good run as a benchmark dataset, even simple models by 
today’s standards achieve classification accuracy over 95%, making it unsuitable for distin- 
guishing between strong models and weaker ones. Even more, the dataset allows for very 
high levels of accuracy, not typically seen in many classification problems. This skewed 
algorithmic development towards specific families of algorithms that can take advantage 
of clean datasets, such as active set methods and boundary-seeking active set algorithms. 
Today, MNIST serves as more of a sanity check than as a benchmark. ImageNet (Deng et 
al., 2009) poses a much more relevant challenge. Unfortunately, ImageNet is too large for 
many of the examples and illustrations in this book, as it would take too long to train to 
make the examples interactive. As a substitute we will focus our discussion in the coming 
sections on the qualitatively similar, but much smaller Fashion-MNIST dataset (Xiao et al., 
2017) which was released in 2017. It contains images of 10 categories of clothing at 28 x28 
pixels resolution. 


%matplotlib inline 

import time 

import torch 

import torchvision 

from torchvision import transforms 
from d21 import torch as d21 


d21.use_svg_display() 


4.2.1 Loading the Dataset 


Since the Fashion-MNIST dataset is so useful, all major frameworks provide preprocessed 
versions of it. We can download and read it into memory using built-in framework utili- 
ties. 
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class FashionMNIST(d21.DataModule): #@save 
"""The Fashion-MNIST dataset.””” 
def __init__(self, batch_size=64, resize=(28, 28)): 
super().__init__Q 
self .save_hyperparameters() 
trans = transforms.Compose([transforms.Resize(resize), 
transforms. ToTensor()]) 
self.train = torchvision.datasets.FashionMNIST( 
root=self.root, train=True, transform=trans, download=True) 
self.val = torchvision.datasets.FashionMNIST( 
root=self.root, train=False, transform=trans, download=True) 


Fashion-MNIST consists of images from 10 categories, each represented by 6000 images 
in the training dataset and by 1000 in the test dataset. A test dataset is used for evaluating 
model performance (it must not be used for training). Consequently the training set and the 
test set contain 60,000 and 10,000 images, respectively. 


data = FashionMNIST(resize=(32, 32)) 
len(data.train), len(data.val) 


(60000, 10000) 


The images are grayscale and upscaled to 32 x 32 pixels in resolution above. This is similar 
to the original MNIST dataset which consisted of (binary) black and white images. Note, 
though, that most modern image data has three channels (red, green, blue) and that hyper- 
spectral images can have in excess of 100 channels (the HyMap sensor has 126 channels). 
By convention we store an image as a c X h X w tensor, where c is the number of color 
channels, A is the height and w is the width. 


data.train[0][0].shape 


torch.Size([1, 32, 32]) 


The categories of Fashion-MNIST have human-understandable names. The following con- 
venience method converts between numeric labels and their names. 


@d21.add_to_class(FashionMNIST) #@save 
def text_labels(self, indices): 
"""Return text labels.""” 
labels = ['t-shirt’, 'trouser', ‘pullover’, 'dress', ‘coat’, 
"sandal', ‘shirt’, 'sneaker', ‘bag’, ‘ankle boot'] 
return [labelsLint(i)] for i in indices] 


4.2.2 Reading a Minibatch 


To make our life easier when reading from the training and test sets, we use the built-in data 
iterator rather than creating one from scratch. Recall that at each iteration, a data iterator 
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reads a minibatch of data with size batch_size. We also randomly shuffle the examples 
for the training data iterator. 


@d21.add_to_class(FashionMNIST) #@save 
def get_dataloader(self, train): 
data = self.train if train else self.val 
return torch.utils.data.DataLoader(data, self.batch_size, shuffle=train, 
num_workers=self .num_workers) 


To see how this works, let’s load a minibatch of images by invoking the train_dataloader 
method. It contains 64 images. 


X, y = next(iter(data.train_dataloader())) 
print(X.shape, X.dtype, y.shape, y.dtype) 


torch.Size([64, 1, 32, 32]) torch.float32 torch.Size([64]) torch. int64 


Let’s look at the time it takes to read the images. Even though it is a built-in loader, it is not 
blazingly fast. Nonetheless, this is sufficient since processing images with a deep network 
takes quite a bit longer. Hence it is good enough that training a network will not be I/O 
constrained. 


tic = time. time() 

for X, y in data. train_dataloader(): 
continue 

f'{time.time() - tic:.2f} sec’ 


"4.69 sec’ 


4.2.3 Visualization 


We will often be using the Fashion-MNIST dataset. A convenience function show_images 
can be used to visualize the images and the associated labels. Skipping implementation 
details, we just show the interface below: we only need to know how to invoke d21. 
show_images rather than how it works for such utility functions. 


def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5): #@save 
VONN ie eG) AMIS OF Images. aii 
raise NotImplementedError 


Let’s put it to good use. In general, it is a good idea to visualize and inspect data that you are 
training on. Humans are very good at spotting oddities and because of that, visualization 
serves as an additional safeguard against mistakes and errors in the design of experiments. 
Here are the images and their corresponding labels (in text) for the first few examples in the 
training dataset. 
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@d21.add_to_class(FashionMNIST) #@save 
def visualize(self, batch, nrows=1, ncols=8, labels=[]): 
X, y = batch 
if not labels: 
labels = self.text_labels(y) 
d21.show_images(X.squeeze(1), nrows, ncols, titles=labels) 
batch = next(iter(data.val_dataloader())) 
data. visualize(batch) 


ankle boot pullover trouser trouser shirt trouser coat shirt 


We are now ready to work with the Fashion-MNIST dataset in the sections that follow. 


4.2.4 Summary 


We now have a slightly more realistic dataset to use for classification. Fashion-MNIST is an 
apparel classification dataset consisting of images representing 10 categories. We will use 
this dataset in subsequent sections and chapters to evaluate various network designs, from 
a simple linear model to advanced residual networks. As we commonly do with images, 
we read them as a tensor of shape (batch size, number of channels, height, width). For now, 
we only have one channel as the images are grayscale (the visualization above uses a false 
color palette for improved visibility). 


Lastly, data iterators are a key component for efficient performance. For instance, we might 
use GPUs for efficient image decompression, video transcoding, or other preprocessing. 
Whenever possible, you should rely on well-implemented data iterators that exploit high- 
performance computing to avoid slowing down your training loop. 


4.2.5 Exercises 
1. Does reducing the batch_size (for instance, to 1) affect the reading performance? 
2. The data iterator performance is important. Do you think the current implementation 


is fast enough? Explore various options to improve it. Use a system profiler to find out 
where the bottlenecks are. 


" 3. Check out the framework’s online API documentation. Which other datasets are avail- 


able? 


Discussions %. 
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4.3 The Base Classification Model 
Se) 


You may have noticed that the implementations from scratch and the concise implementa- 
tion using framework functionality were quite similar in the case of regression. The same 
is true for classification. Since many models in this book deal with classification, it is worth 
adding functionalities to support this setting specifically. This section provides a base class 
for classification models to simplify future code. 


import torch 
from d21 import torch as d21 


4.3.1 The Classifier Class 


We define the Classifier class below. In the validation_step we report both the loss 
value and the classification accuracy on a validation batch. We draw an update for every 
num_val_batches batches. This has the benefit of generating the averaged loss and ac- 
curacy on the whole validation data. These average numbers are not exactly correct if the 
final batch contains fewer examples, but we ignore this minor difference to keep the code 
simple. 


class Classifier(d21.Module): #@save 
"""The base class of classification models. 
def validation_step(self, batch): 
Y_hat = self(*batch[:-1]) 
self.plot(’loss’, self.loss(Y_hat, batch[-1]), train=False) 
self.plot(’acc', self.accuracy(Y_hat, batch[-1]), train=False) 


nnn 


By default we use a stochastic gradient descent optimizer, operating on minibatches, just 
as we did in the context of linear regression. 


@d21.add_to_class(d21.Module) #@save 
def configure_optimizers(self): 
return torch.optim.SGD(self.parameters(), lr=self.1r) 


4.3.2 Accuracy 


Given the predicted probability distribution y_hat, we typically choose the class with the 
highest predicted probability whenever we must output a hard prediction. Indeed, many 
applications require that we make a choice. For instance, Gmail must categorize an email 


into “Primary”, “Social”, “Updates”, “Forums”, or “Spam”. It might estimate probabilities 
internally, but at the end of the day it has to choose one among the classes. 


When predictions are consistent with the label class y, they are correct. The classification 
accuracy is the fraction of all predictions that are correct. Although it can be difficult to 
optimize accuracy directly (it is not differentiable), it is often the performance measure that 
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we care about the most. It is often the relevant quantity in benchmarks. As such, we will 
nearly always report it when training classifiers. 


Accuracy is computed as follows. First, if y_hat is a matrix, we assume that the second di- 
mension stores prediction scores for each class. We use argmax to obtain the predicted class 
by the index for the largest entry in each row. Then we compare the predicted class with 
the ground truth y elementwise. Since the equality operator == is sensitive to data types, 
we convert y_hat’s data type to match that of y. The result is a tensor containing entries 
of 0 (false) and 1 (true). Taking the sum yields the number of correct predictions. 


@d21.add_to_class(Classifier) #@save 

def accuracy(self, Y_hat, Y, averaged=True): 
"""Compute the number of correct predictions. 
Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1])) 
preds = Y_hat.argmax(axis=1).type(Y.dtype) 
compare = (preds == Y.reshape(-1)).type(torch. float32) 
return compare.mean() if averaged else compare 


nnn 


4.3.3 Summary 


Classification is a sufficiently common problem that it warrants its own convenience func- 
tions. Of central importance in classification is the accuracy of the classifier. Note that 
while we often care primarily about accuracy, we train classifiers to optimize a variety of 
other objectives for statistical and computational reasons. However, regardless of which 
loss function was minimized during training, it is useful to have a convenience method for 
assessing the accuracy of our classifier empirically. 


4.3.4 Exercises 


1. Denote by Ly the validation loss, and let L{ be its quick and dirty estimate computed 
by the loss function averaging in this section. Lastly, denote by /° the loss on the last 
minibatch. Express Ly in terms of Li, ce and the sample and minibatch sizes. 


. Show that the quick and dirty estimate L{ is unbiased. That is, show that E[Ly] = 
E[L¢]. Why would you still want to use Ly instead? 


3. Given a multiclass classification loss, denoting by /(y, y’) the penalty of estimating 
y’ when we see y and given a probabilty p(y | x), formulate the rule for an optimal 
selection of y’. Hint: express the expected loss, using / and p(y | x). 


Discussions’. 
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4.4 Softmax Regression Implementation from 
Scratch 


Because softmax regression is so fundamental, we believe that you ought to know how to 
implement it yourself. Here, we limit ourselves to defining the softmax-specific aspects of 
the model and reuse the other components from our linear regression section, including the 
training loop. 


import torch 
from d21 import torch as d21 


4.4.1 The Softmax 


Let’s begin with the most important part: the mapping from scalars to probabilities. For a 
refresher, recall the operation of the sum operator along specific dimensions in a tensor, as 
discussed in Section 2.3.6 and Section 2.3.7. Given a matrix X we can sum over all elements 
(by default) or only over elements in the same axis. The axis variable lets us compute row 
and column sums: 


X = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) 
X.sum(@, keepdims=True), X.sum(1, keepdims=True) 


(tensor([[5., 7., 9.11), 
tensor ([[ 6.], 
[15.]])) 


Computing the softmax requires three steps: (i) exponentiation of each term; (ii) a sum 
over each row to compute the normalization constant for each example; (iii) division of 
each row by its normalization constant, ensuring that the result sums to 1: 


exp(Xj;) 
Dk exp(Xix) 
The (logarithm of the) denominator is called the (log) partition function. It was introduced 


in statistical physics°° to sum over all possible states in a thermodynamic ensemble. The 
implementation is straightforward: 


softmax(X);; = (4.4.1) 


def softmax(X): 
X_exp = torch.exp(X) 
partition = X_exp.sum(1, keepdims=True) 
return X_exp / partition # The broadcasting mechanism is applied here 


For any input X, we turn each element into a nonnegative number. Each row sums up to 
1, as is required for a probability. Caution: the code above is not robust against very large 
or very small arguments. While it is sufficient to illustrate what is happening, you should 
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not use this code verbatim for any serious purpose. Deep learning frameworks have such 
protections built in and we will be using the built-in softmax going forward. 


X = torch.rand((2, 5)) 
X_prob = softmax(X) 
X_prob, X_prob.sum(1) 


(tensor([[@.2511, 0.1417, 0.1158, 0.2529, 0.2385], 
[0.2004, 0.1419, 0.1957, 0.2504, @.2117]]), 
tensor([1., 1.])) 


4.4.2 The Model 


We now have everything that we need to implement the softmax regression model. As in 
our linear regression example, each instance will be represented by a fixed-length vector. 
Since the raw data here consists of 28 x 28 pixel images, we flatten each image, treating 
them as vectors of length 784. In later chapters, we will introduce convolutional neural 
networks, which exploit the spatial structure in a more satisfying way. 


In softmax regression, the number of outputs from our network should be equal to the 
number of classes. Since our dataset has 10 classes, our network has an output dimension 
of 10. Consequently, our weights constitute a 784 x 10 matrix plus a 1 x 10 row vector for 
the biases. As with linear regression, we initialize the weights W with Gaussian noise. The 
biases are initialized as zeros. 


class SoftmaxRegressionScratch(d21.Classifier): 
def __init__(self, num_inputs, num_outputs, Ir, sigma=0.01): 
super().__init__Q 
self .save_hyperparameters() 
self.W = torch.normal(®, sigma, size=(num_inputs, num_outputs), 
requires_grad=True) 
self.b = torch.zeros(num_outputs, requires_grad=True) 


def parameters(self): 
return [self.W, self.b] 


The code below defines how the network maps each input to an output. Note that we flatten 
each 28 x 28 pixel image in the batch into a vector using reshape before passing the data 
through our model. 


@d21.add_to_class(SoftmaxRegressionScratch) 
def forward(self, X): 
X = X.reshape((-1, self.W.shape[Q])) 
return softmax(torch.matmul(X, self.W) + self.b) 


4.4.3 The Cross-Entropy Loss 


Next we need to implement the cross-entropy loss function (introduced in Section 4.1.2). 
This may be the most common loss function in all of deep learning. At the moment, appli- 
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cations of deep learning easily cast as classification problems far outnumber those better 
treated as regression problems. 


Recall that cross-entropy takes the negative log-likelihood of the predicted probability as- 
signed to the true label. For efficiency we avoid Python for-loops and use indexing instead. 
In particular, the one-hot encoding in y allows us to select the matching terms in y. 


To see this in action we create sample data y_hat with 2 examples of predicted probabilities 
over 3 classes and their corresponding labels y. The correct labels are 0 and 2 respectively 
(i.e., the first and third class). Using y as the indices of the probabilities in y_hat, we can 
pick out terms efficiently. 


y = torch.tensor([@, 2]) 
y_hat = torch.tensor([[0.1, 0.3, 0.6], [0.3, @.2, @.5]]) 
y macika, dal v] 


tensor(LQ.1000, 0.5000]) 


Now we can implement the cross-entropy loss function by averaging over the logarithms of 
the selected probabilities. 


def cross_entropy(y_hat, y): 
return -torch.log(y_hat[list(range(len(y_hat))), yJ]).mean() 


cross_entropy(y_hat, y) 
tensor (1.4979) 


@d21.add_to_class(SoftmaxRegressionScratch) 
def loss(self, y_hat, y): 
return cross_entropy(y_hat, y) 


4.4.4 Training 


We reuse the fit method defined in Section 3.4 to train the model with 10 epochs. Note that 
the number of epochs (max_epochs), the minibatch size (batch_size), and learning rate 
(1r) are adjustable hyperparameters. That means that while these values are not learned 
during our primary training loop, they still influence the performance of our model, both 
vis-a-vis training and generalization performance. In practice you will want to choose these 
values based on the validation split of the data and then, ultimately, to evaluate your final 
model on the fest split. As discussed in Section 3.6.3, we will regard the test data of Fashion- 
MNIST as the validation set, thus reporting validation loss and validation accuracy on this 
split. 
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data = d21.FashionMNIST (batch_size=256) 

model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1) 
trainer = d21.Trainer(max_epochs=10) 

trainer.fit(model, data) 


0.94 

0.84 nee aa 
— train_loss 

0.7 5 --- val_loss 
=- val_acc 

0.6 4 Q = 

0.54 

0 2 4 6 8 10 


4.4.5 Prediction 


Now that training is complete, our model is ready to classify some images. 


X, y = next(iter(data.val_dataloader())) 
preds = model (X) .argmax(axis=1) 
preds.shape 


torch. Size([256]) 


We are more interested in the images we label incorrectly. We visualize them by comparing 
their actual labels (first line of text output) with the predictions from the model (second line 
of text output). 


wrong = preds.type(y.dtype) != y 

X, y, preds = XLwrong], yLwrong], preds[wrong] 

labels = [a+'\n'+b for a, b in zip( 
data.text_labels(y), data. text_labels(preds))] 

data.visualize([X, y], labels=labels) 


sneaker coat pullover sandal ankle boot coat shirt pullover 
t-shirt shirt 


sandal shirt t-shirt sneaker sneaker pullover 


4.4.6 Summary 


By now we are starting to get some experience with solving linear regression and classifi- 
cation problems. With it, we have reached what would arguably be the state of the art of 
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1960-1970s of statistical modeling. In the next section, we will show you how to leverage 
deep learning frameworks to implement this model much more efficiently. 


4.4.7 Exercises 


1. In this section, we directly implemented the softmax function based on the mathematical 


definition of the softmax operation. As discussed in Section 4.1 this can cause numerical 
instabilities. 


1. Test whether softmax still works correctly if an input has a value of 100. 


2. Test whether sof tmax still works correctly if the largest of all inputs is smaller than 
—100? 


3. Implement a fix by looking at the value relative to the largest entry in the argument. 


. Implement a cross_entropy function that follows the definition of the cross-entropy 


loss function >), y; log $;. 

1. Try it out in the code example of this section. 

2. Why do you think it runs more slowly? 

3. Should you use it? When would it make sense to? 


4. What do you need to be careful of? Hint: consider the domain of the logarithm. 


. Is it always a good idea to return the most likely label? For example, would you do this 


for medical diagnosis? How would you try to address this? 


. Assume that we want to use softmax regression to predict the next word based on some 


features. What are some problems that might arise from a large vocabulary? 


5. Experiment with the hyperparameters of the code in this section. In particular: 


1. Plot how the validation loss changes as you change the learning rate. 


2. Do the validation and training loss change as you change the minibatch size? How 
large or small do you need to go before you see an effect? 
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4.5 Concise Implementation of Softmax Regression 
T) 


Just as high-level deep learning frameworks made it easier to implement linear regression 
(see Section 3.5), they are similarly convenient here. 
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import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


4.5.1 Defining the Model 


As in Section 3.5, we construct our fully connected layer using the built-in layer. The built- 
in ___call__ method then invokes forward whenever we need to apply the network to some 
input. 


We use a Flatten layer to convert the fourth-order tensor X to second order by keeping the 
dimensionality along the first axis unchanged. 


class SoftmaxRegression(d21.Classifier): #@save 
"""The softmax regression model.”"” 
def __init__(self, num_outputs, Ir): 
süper O. imit O 
self.save_hyperparameters() 
self.net = nn.Sequential(nn.Flatten(), 
nn.LazyLinear (num_outputs) ) 


def forward(self, X): 
return self.net(X) 


4.5.2 Softmax Revisited 


In Section 4.4 we calculated our model’s output and applied the cross-entropy loss. While 
this is perfectly reasonable mathematically, it is risky computationally, because of numer- 
ical underflow and overflow in the exponentiation. 


Recall that the softmax function computes probabilities via }; = ae 


Ox are very large, i.e., very positive, then exp(ox) might be larger than the largest number 
we can have for certain data types. This is called overflow. Likewise, if every argument is 
avery large negative number, we will get underflow. For instance, single precision floating 
point numbers approximately cover the range of 10738 to 1038. As such, if the largest term 
in o lies outside the interval [—90, 90], the result will not be stable. A way round this 


If some of the 


; _ def ; 
problem is to subtract 6 = max, 0; from all entries: 


z expoj exp(o; — 0) expo exp(o; — 0) 
I= = - - = —. (4.5.1) 
DkexpoK LeeXp(Ox—O)expO Xpexp(ok — ð) 

By construction we know that 0; — 6 < 0 for all j. As such, for a q-class classification 
problem, the denominator is contained in the interval [1,q]. Moreover, the numerator 
never exceeds 1, thus preventing numerical overflow. Numerical underflow only occurs 
when exp(o; — 0) numerically evaluates as 0. Nonetheless, a few steps down the road we 
might find ourselves in trouble when we want to compute log $; as log 0. In particular, in 
backpropagation, we might find ourselves faced with a screenful of the dreaded NaN (Not a 
Number) results. 
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Fortunately, we are saved by the fact that even though we are computing exponential func- 
tions, we ultimately intend to take their log (when calculating the cross-entropy loss). By 
combining softmax and cross-entropy, we can escape the numerical stability issues alto- 
gether. We have: 


exp(o; — 0) 


log ĵ; = log — = 
PENI = ME S expe) 


oj — ō — log X | exp(ox — ô). (4.5.2) 
k 

This avoids both overflow and underflow. We will want to keep the conventional softmax 

function handy in case we ever want to evaluate the output probabilities by our model. But 

instead of passing softmax probabilities into our new loss function, we just pass the logits 

and compute the softmax and its log all at once inside the cross-entropy loss function, which 

does smart things like the “LogSumExp trick” °S. 


* @d21.add_to_class(d21.Classifier) #@save 


def loss(self, Y_hat, Y, averaged=True): 
Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1])) 
Y = Y.reshape((-1,)) 
return F.cross_entropy( 
Y_hat, Y, reduction='mean'’ if averaged else 'none’) 


4.5.3 Training 


Next we train our model. We use Fashion-MNIST images, flattened to 784-dimensional 
feature vectors. 


data = d21.FashionMNIST (batch_size=256) 

model = SoftmaxRegression(num_outputs=10, 1r=0.1) 
trainer = d21.Trainer(max_epochs=10) 
trainer.fit(model, data) 


0.9 5 

08d | eo: ao 
— train_loss 

Or === val_loss 
—-- val acc 

0.64 

0.54 

9 2 4 6 8 10 


As before, this algorithm converges to a solution that is reasonably accurate, albeit this time 
with fewer lines of code than before. 


4.5.4 Summary 


High-level APIs are very convenient at hiding from their user potentially dangerous aspects, 
such as numerical stability. Moreover, they allow users to design models concisely with 
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very few lines of code. This is both a blessing and a curse. The obvious benefit is that it 
makes things highly accessible, even to engineers who never took a single class of statistics 
in their life (in fact, they are part of the target audience of the book). But hiding the sharp 
edges also comes with a price: a disincentive to add new and different components on your 
own, since there is little muscle memory for doing it. Moreover, it makes it more difficult 
to fix things whenever the protective padding of a framework fails to cover all the corner 
cases entirely. Again, this is due to lack of familiarity. 


As such, we strongly urge you to review both the bare bones and the elegant versions of 
many of the implementations that follow. While we emphasize ease of understanding, the 
implementations are nonetheless usually quite performant (convolutions are the big excep- 
tion here). It is our intention to allow you to build on these when you invent something new 
that no framework can give you. 


4.5.5 Exercises 


1. Deep learning uses many different number formats, including FP64 double precision 
(used extremely rarely), FP32 single precision, BFLOAT16 (good for compressed rep- 
resentations), FP16 (very unstable), TF32 (a new format from NVIDIA), and INTS8. 
Compute the smallest and largest argument of the exponential function for which the 
result does not lead to numerical underflow or overflow. 


2. INT8 is a very limited format consisting of nonzero numbers from 1 to 255. How could 
you extend its dynamic range without using more bits? Do standard multiplication and 
addition still work? 


3. Increase the number of epochs for training. Why might the validation accuracy decrease 
after a while? How could we fix this? 


4. What happens as you increase the learning rate? Compare the loss curves for several 
learning rates. Which one works better? When? 


Discussions”? . 


4.6 Generalization in Classification 
SSS eee SSS EE Ses 


So far, we have focused on how to tackle multiclass classification problems by training 
(linear) neural networks with multiple outputs and softmax functions. Interpreting our 
model’s outputs as probabilistic predictions, we motivated and derived the cross-entropy 
loss function, which calculates the negative log likelihood that our model (for a fixed set 
of parameters) assigns to the actual labels. And finally, we put these tools into practice 
by fitting our model to the training set. However, as always, our goal is to learn general 
patterns, as assessed empirically on previously unseen data (the test set). High accuracy 
on the training set means nothing. Whenever each of our inputs is unique (and indeed this 
is true for most high-dimensional datasets), we can attain perfect accuracy on the training 
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set by just memorizing the dataset on the first training epoch, and subsequently looking up 
the label whenever we see a new image. And yet, memorizing the exact labels associated 
with the exact training examples does not tell us how to classify new examples. Absent 
further guidance, we might have to fall back on random guessing whenever we encounter 
new examples. 


A number of burning questions demand immediate attention: 


1. How many test examples do we need to give a good estimate of the accuracy of our 
classifiers on the underlying population? 


2. What happens if we keep evaluating models on the same test repeatedly? 


3. Why should we expect that fitting our linear models to the training set should fare any 
better than our naive memorization scheme? 


Whereas Section 3.6 introduced the basics of overfitting and generalization in the context of 
linear regression, this chapter will go a little deeper, introducing some of the foundational 
ideas of statistical learning theory. It turns out that we often can guarantee generalization 
a priori: for many models, and for any desired upper bound on the generalization gap e, 
we can often determine some required number of samples n such that if our training set 
contains at least n samples, our empirical error will lie within e€ of the true error, for any 
data generating distribution. Unfortunately, it also turns out that while these sorts of guar- 
antees provide a profound set of intellectual building blocks, they are of limited practical 
utility to the deep learning practitioner. In short, these guarantees suggest that ensuring 
generalization of deep neural networks a priori requires an absurd number of examples 
(perhaps trillions or more), even when we find that, on the tasks we care about, deep neural 
networks typically generalize remarkably well with far fewer examples (thousands). Thus 
deep learning practitioners often forgo a priori guarantees altogether, instead employing 
methods that have generalized well on similar problems in the past, and certifying gen- 
eralization post hoc through empirical evaluations. When we get to Chapter 5, we will 
revisit generalization and provide a light introduction to the vast scientific literature that 
has sprung in attempts to explain why deep neural networks generalize in practice. 


4.6.1 The Test Set 


Since we have already begun to rely on test sets as the gold standard method for assessing 
generalization error, let’s get started by discussing the properties of such error estimates. 
Let’s focus on a fixed classifier f, without worrying about how it was obtained. Moreover 
suppose that we possess a fresh dataset of examples D = (x, yy that were not used 
to train the classifier f. The empirical error of our classifier f on D is simply the fraction 
of instances for which the prediction f(x”) disagrees with the true label y® , and is given 
by the following expression: 


n 


col f) == UF) yO). (4.6.1) 
i=1 


By contrast, the population error is the expected fraction of examples in the underlying pop- 
ulation (some distribution P(X, Y) characterized by probability density function p(x, y)) 
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for which our classifier disagrees with the true label: 


NE eee ee i i LPO ype) dads. (4.6.2) 


While e( f) is the quantity that we actually care about, we cannot observe it directly, just 
as we cannot directly observe the average height in a large population without measuring 
every single person. We can only estimate this quantity based on samples. Because our 
test set D is statistically representative of the underlying population, we can view ep( f) 
as a Statistical estimator of the population error e(f). Moreover, because our quantity of 
interest e( f) is an expectation (of the random variable 1( f(X) + Y)) and the corresponding 
estimator €p (f) is the sample average, estimating the population error is simply the classic 
problem of mean estimation, which you may recall from Section 2.6. 


An important classical result from probability theory called the central limit theorem guar- 
antees that whenever we possess n random samples a1, ...,a, drawn from any distribution 
with mean u and standard deviation o, then, as the number of samples n approaches infin- 
ity, the sample average i approximately tends towards a normal distribution centered at the 
true mean and with standard deviation a /yn. Already, this tells us something important: 
as the number of examples grows large, our test error €p (f) should approach the true error 
e(f) at arate of O(1//n). Thus, to estimate our test error twice as precisely, we must 
collect four times as large a test set. To reduce our test error by a factor of one hundred, we 
must collect ten thousand times as large a test set. In general, such a rate of O(1/¥n) is 
often the best we can hope for in statistics. 


Now that we know something about the asymptotic rate at which our test error ep(/f) 
converges to the true error e( f), we can zoom in on some important details. Recall that 
the random variable of interest 1( f(X) + Y) can only take values O and 1 and thus is 
a Bernoulli random variable, characterized by a parameter indicating the probability that 
it takes value 1. Here, | means that our classifier made an error, so the parameter of our 
random variable is actually the true error rate e( f). The variance o? of a Bernoulli depends 
on its parameter (here, e(f)) according to the expression e(f)(1 — e(f)). While e( f) is 
initially unknown, we know that it cannot be greater than 1. A little investigation of this 
function reveals that our variance is highest when the true error rate is close to 0.5 and can 
be far lower when it is close to 0 or close to 1. This tells us that the asymptotic standard 
deviation of our estimate €n (f) of the error e( f) (over the choice of the n test samples) 


cannot be any greater than 0.25/n. 


If we ignore the fact that this rate characterizes behavior as the test set size approaches 
infinity rather than when we possess finite samples, this tells us that if we want our test 
error €p(f) to approximate the population error e(f) such that one standard deviation 
corresponds to an interval of +0.01, then we should collect roughly 2500 samples. If we 
want to fit two standard deviations in that range and thus be 95% confident that ep(f) € 
e(f) + 0.01, then we will need 10,000 samples! 


This turns out to be the size of the test sets for many popular benchmarks in machine learn- 
ing. You might be surprised to find out that thousands of applied deep learning papers get 
published every year making a big deal out of error rate improvements of 0.01 or less. Of 
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course, when the error rates are much closer to 0, then an improvement of 0.01 can indeed 
be a big deal. 


One pesky feature of our analysis thus far is that it really only tells us about asymptotics, 
i.e., how the relationship between €n and e evolves as our sample size goes to infinity. 
Fortunately, because our random variable is bounded, we can obtain valid finite sample 
bounds by applying an inequality due to Hoeffding (1963): 


P(en(f) —€(f) > t) < exp (-2n?) l (4.6.3) 


Solving for the smallest dataset size that would allow us to conclude with 95% confidence 
that the distance t between our estimate €n (f) and the true error rate e( f) does not exceed 
0.01, you will find that roughly 15,000 examples are required as compared to the 10,000 
examples suggested by the asymptotic analysis above. If you go deeper into statistics you 
will find that this trend holds generally. Guarantees that hold even in finite samples are 
typically slightly more conservative. Note that in the scheme of things, these numbers 
are not so far apart, reflecting the general usefulness of asymptotic analysis for giving us 
ballpark figures even if they are not guarantees we can take to court. 


4.6.2 Test Set Reuse 


In some sense, you are now set up to succeed at conducting empirical machine learning 
research. Nearly all practical models are developed and validated based on test set perfor- 
mance and you are now a master of the test set. For any fixed classifier f, you know how 
to evaluate its test error €p(f), and know precisely what can (and cannot) be said about its 
population error e( f). 


So let’s say that you take this knowledge and prepare to train your first model fı. Knowing 
just how confident you need to be in the performance of your classifier’s error rate you apply 
our analysis above to determine an appropriate number of examples to set aside for the test 
set. Moreover, let’s assume that you took the lessons from Section 3.6 to heart and made 
sure to preserve the sanctity of the test set by conducting all of your preliminary analysis, 
hyperparameter tuning, and even selection among multiple competing model architectures 
on a validation set. Finally you evaluate your model fı on the test set and report an unbiased 
estimate of the population error with an associated confidence interval. 


So far everything seems to be going well. However, that night you wake up at 3am with 
a brilliant idea for a new modeling approach. The next day, you code up your new model, 
tune its hyperparameters on the validation set and not only are you getting your new model 
f2 to work but its error rate appears to be much lower than f;’s. However, the thrill of 
discovery suddenly fades as you prepare for the final evaluation. You do not have a test 
set! 


Even though the original test set D is still sitting on your server, you now face two formidable 
problems. First, when you collected your test set, you determined the required level of pre- 
cision under the assumption that you were evaluating a single classifier f. However, if 
you get into the business of evaluating multiple classifiers fi, ..., fg on the same test set, 
you must consider the problem of false discovery. Before, you might have been 95% sure 
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that ep(f) € e(f) + 0.01 for a single classifier f and thus the probability of a misleading 
result was a mere 5%. With k classifiers in the mix, it can be hard to guarantee that there 
is not even one among them whose test set performance is misleading. With 20 classifiers 
under consideration, you might have no power at all to rule out the possibility that at least 
one among them received a misleading score. This problem relates to multiple hypothesis 
testing, which despite a vast literature in statistics, remains a persistent problem plaguing 
scientific research. 


If that is not enough to worry you, there is a special reason to distrust the results that you 
get on subsequent evaluations. Recall that our analysis of test set performance rested on 
the assumption that the classifier was chosen absent any contact with the test set and thus 
we could view the test set as drawn randomly from the underlying population. Here, not 
only are you testing multiple functions, the subsequent function fọ was chosen after you 
observed the test set performance of fı. Once information from the test set has leaked 
to the modeler, it can never be a true test set again in the strictest sense. This problem is 
called adaptive overfitting and has recently emerged as a topic of intense interest to learning 
theorists and statisticians (Dwork et al., 2015). Fortunately, while it is possible to leak all 
information out of a holdout set, and the theoretical worst case scenarios are bleak, these 
analyses may be too conservative. In practice, take care to create real test sets, to consult 
them as infrequently as possible, to account for multiple hypothesis testing when reporting 
confidence intervals, and to dial up your vigilance more aggressively when the stakes are 
high and your dataset size is small. When running a series of benchmark challenges, it is 
often good practice to maintain several test sets so that after each round, the old test set can 
be demoted to a validation set. 


4.6.3 Statistical Learning Theory 


Put simply, test sets are all that we really have, and yet this fact seems strangely unsatisfy- 
ing. First, we seldom possess a true test set—unless we are the ones creating the dataset, 
someone else has probably already evaluated their own classifier on our ostensible “test 
set”. And even when we have first dibs, we soon find ourselves frustrated, wishing we 
could evaluate our subsequent modeling attempts without the gnawing feeling that we can- 
not trust our numbers. Moreover, even a true test set can only tell us post hoc whether a 
classifier has in fact generalized to the population, not whether we have any reason to expect 
a priori that it should generalize. 


With these misgivings in mind, you might now be sufficiently primed to see the appeal of 
statistical learning theory, the mathematical subfield of machine learning whose practi- 
tioners aim to elucidate the fundamental principles that explain why/when models trained 
on empirical data can/will generalize to unseen data. One of the primary aims of statistical 
learning researchers has been to bound the generalization gap, relating the properties of the 
model class to the number of samples in the dataset. 


Learning theorists aim to bound the difference between the empirical error €s(fs) of a 
learned classifier fs, both trained and evaluated on the training set S, and the true error 
e( fs) of that same classifier on the underlying population. This might look similar to 
the evaluation problem that we just addressed but there is a major difference. Earlier, the 


152 


Linear Neural Networks for Classification 


classifier f was fixed and we only needed a dataset for evaluative purposes. And indeed, 
any fixed classifier does generalize: its error on a (previously unseen) dataset is an unbiased 
estimate of the population error. But what can we say when a classifier is trained and 
evaluated on the same dataset? Can we ever be confident that the training error will be 
close to the testing error? 


Suppose that our learned classifier fs must be chosen from some pre-specified set of func- 
tions F. Recall from our discussion of test sets that while it is easy to estimate the error of a 
single classifier, things get hairy when we begin to consider collections of classifiers. Even 
if the empirical error of any one (fixed) classifier will be close to its true error with high 
probability, once we consider a collection of classifiers, we need to worry about the possi- 
bility that just one of them will receive a badly estimated error. The worry is that we might 
pick such a classifier and thereby grossly underestimate the population error. Moreover, 
even for linear models, because their parameters are continuously valued, we are typically 
choosing from an infinite class of functions (|F| = ©). 


One ambitious solution to the problem is to develop analytic tools for proving uniform 
convergence, i.e., that with high probability, the empirical error rate for every classifier 
in the class f €e F will simultaneously converge to its true error rate. In other words, 
we seek a theoretical principle that would allow us to state that with probability at least 
1 — 6 (for some small 6) no classifier’s error rate e( f) (among all classifiers in the class 
F) will be misestimated by more than some small amount «œ. Clearly, we cannot make 
such statements for all model classes F. Recall the class of memorization machines that 
always achieve empirical error 0 but never outperform random guessing on the underlying 
population. 


In a sense the class of memorizers is too flexible. No such a uniform convergence result 
could possibly hold. On the other hand, a fixed classifier is useless—it generalizes perfectly, 
but fits neither the training data nor the test data. The central question of learning has 
thus historically been framed as a trade-off between more flexible (higher variance) model 
classes that better fit the training data but risk overfitting, versus more rigid (higher bias) 
model classes that generalize well but risk underfitting. A central question in learning 
theory has been to develop the appropriate mathematical analysis to quantify where a model 
sits along this spectrum, and to provide the associated guarantees. 


In a series of seminal papers, Vapnik and Chervonenkis extended the theory on the con- 
vergence of relative frequencies to more general classes of functions (Vapnik and Cher- 
vonenkis, 1964, Vapnik and Chervonenkis, 1968, Vapnik and Chervonenkis, 1971, Vap- 
nik and Chervonenkis, 1981, Vapnik and Chervonenkis, 1991, Vapnik and Chervonenkis, 
1974). One of the key contributions of this line of work is the Vapnik—Chervonenkis (VC) 
dimension, which measures (one notion of) the complexity (flexibility) of a model class. 
Moreover, one of their key results bounds the difference between the empirical error and 
the population error as a function of the VC dimension and the number of samples: 


P (R[p, f] — Remp[X, Y, f] < a) > 1-6 for a > cy (VC -log 6)/n. (4.6.4) 


Here 6 > 0 is the probability that the bound is violated, œ is the upper bound on the 
generalization gap, and n is the dataset size. Lastly, c > 0 is a constant that depends only 
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on the scale of the loss that can be incurred. One use of the bound might be to plug in 
desired values of ô and a to determine how many samples to collect. The VC dimension 
quantifies the largest number of data points for which we can assign any arbitrary (binary) 
labeling and for each find some model f in the class that agrees with that labeling. For 
example, linear models on d-dimensional inputs have VC dimension d + 1. It is easy to 
see that a line can assign any possible labeling to three points in two dimensions, but not 
to four. Unfortunately, the theory tends to be overly pessimistic for more complex models 
and obtaining this guarantee typically requires far more examples than are actually needed 
to achieve the desired error rate. Note also that fixing the model class and 6, our error rate 
again decays with the usual O(1/¥n) rate. It seems unlikely that we could do better in 
terms of n. However, as we vary the model class, VC dimension can present a pessimistic 
picture of the generalization gap. 


4.6.4 Summary 


The most straightforward way to evaluate a model is to consult a test set comprised of pre- 
viously unseen data. Test set evaluations provide an unbiased estimate of the true error 
and converge at the desired O(1/-/n) rate as the test set grows. We can provide approx- 
imate confidence intervals based on exact asymptotic distributions or valid finite sample 
confidence intervals based on (more conservative) finite sample guarantees. Indeed test 
set evaluation is the bedrock of modern machine learning research. However, test sets are 
seldom true test sets (used by multiple researchers again and again). Once the same test set 
is used to evaluate multiple models, controlling for false discovery can be difficult. This 
can cause huge problems in theory. In practice, the significance of the problem depends on 
the size of the holdout sets in question and whether they are merely being used to choose 
hyperparameters or if they are leaking information more directly. Nevertheless, it is good 
practice to curate real test sets (or multiple) and to be as conservative as possible about how 
often they are used. 


Hoping to provide a more satisfying solution, statistical learning theorists have developed 
methods for guaranteeing uniform convergence over a model class. If indeed every model’s 
empirical error simultaneously converges to its true error, then we are free to choose the 
model that performs best, minimizing the training error, knowing that it too will perform 
similarly well on the holdout data. Crucially, any one of such results must depend on some 
property of the model class. Vladimir Vapnik and Alexey Chernovenkis introduced the VC 
dimension, presenting uniform convergence results that hold for all models in a VC class. 
The training errors for all models in the class are (simultaneously) guaranteed to be close 
to their true errors, and guaranteed to grow even closer at O(1/~n) rates. Following the 
revolutionary discovery of VC dimension, numerous alternative complexity measures have 
been proposed, each facilitating an analogous generalization guarantee. See Boucheron 
et al. (2005) for a detailed discussion of several advanced ways of measuring function 
complexity. Unfortunately, while these complexity measures have become broadly useful 
tools in statistical theory, they turn out to be powerless (as straightforwardly applied) for 
explaining why deep neural networks generalize. Deep neural networks often have millions 
of parameters (or more), and can easily assign random labels to large collections of points. 
Nevertheless, they generalize well on practical problems and, surprisingly, they often gen- 
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eralize better, when they are larger and deeper, despite incurring higher VC dimensions. In 
the next chapter, we will revisit generalization in the context of deep learning. 


4.6.5 Exercises 


1. If we wish to estimate the error of a fixed model f to within 0.0001 with probability 
greater than 99.9%, how many samples do we need? 


2. Suppose that somebody else possesses a labeled test set D and only makes available the 
unlabeled inputs (features). Now suppose that you can only access the test set labels by 
running a model f (with no restrictions placed on the model class) on each of the un- 
labeled inputs and receiving the corresponding error €y(f). How many models would 
you need to evaluate before you leak the entire test set and thus could appear to have 
error 0, regardless of your true error? 


3. What is the VC dimension of the class of fifth-order polynomials? 


4. What is the VC dimension of axis-aligned rectangles on two-dimensional data? 


Discussions !°°. 


4.7 Environment and Distribution Shift 


In the previous sections, we worked through a number of hands-on applications of machine 
learning, fitting models to a variety of datasets. And yet, we never stopped to contemplate 
either where data came from in the first place or what we ultimately plan to do with the 
outputs from our models. Too often, machine learning developers in possession of data 
rush to develop models without pausing to consider these fundamental issues. 


Many failed machine learning deployments can be traced back to this failure. Sometimes 
models appear to perform marvelously as measured by test set accuracy but fail catastroph- 
ically in deployment when the distribution of data suddenly shifts. More insidiously, some- 
times the very deployment of a model can be the catalyst that perturbs the data distribution. 
Say, for example, that we trained a model to predict who will repay rather than default on a 
loan, finding that an applicant’s choice of footwear was associated with the risk of default 
(Oxfords indicate repayment, sneakers indicate default). We might be inclined thereafter 
to grant a loan to any applicant wearing Oxfords and to deny all applicants wearing sneak- 
ers. 


In this case, our ill-considered leap from pattern recognition to decision-making and our 
failure to critically consider the environment might have disastrous consequences. For 
starters, as soon as we began making decisions based on footwear, customers would catch 
on and change their behavior. Before long, all applicants would be wearing Oxfords, with- 
out any coincident improvement in credit-worthiness. Take a minute to digest this because 
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similar issues abound in many applications of machine learning: by introducing our model- 
based decisions to the environment, we might break the model. 


While we cannot possibly give these topics a complete treatment in one section, we aim here 
to expose some common concerns, and to stimulate the critical thinking required to detect 
such situations early, mitigate damage, and use machine learning responsibly. Some of the 
solutions are simple (ask for the “right” data), some are technically difficult (implement a 
reinforcement learning system), and others require that we step outside the realm of sta- 
tistical prediction altogether and grapple with difficult philosophical questions concerning 
the ethical application of algorithms. 


4.7.1 Types of Distribution Shift 


To begin, we stick with the passive prediction setting considering the various ways that data 
distributions might shift and what might be done to salvage model performance. In one clas- 
sic setup, we assume that our training data was sampled from some distribution ps (x, y) 
but that our test data will consist of unlabeled examples drawn from some different distri- 
bution pr(x, y). Already, we must confront a sobering reality. Absent any assumptions on 
how ps and pr relate to each other, learning a robust classifier is impossible. 


Consider a binary classification problem, where we wish to distinguish between dogs and 
cats. If the distribution can shift in arbitrary ways, then our setup permits the pathological 
case in which the distribution over inputs remains constant: ps(x) = pr(x), but the labels 
are all flipped: ps(y | x) = 1 — pr(y | x). In other words, if God can suddenly decide that 
in the future all “cats” are now dogs and what we previously called “dogs” are now cats— 
without any change in the distribution of inputs p(x), then we cannot possibly distinguish 
this setting from one in which the distribution did not change at all. 


Fortunately, under some restricted assumptions on the ways our data might change in the fu- 
ture, principled algorithms can detect shift and sometimes even adapt on the fly, improving 
on the accuracy of the original classifier. 


Covariate Shift 


Among categories of distribution shift, covariate shift may be the most widely studied. 
Here, we assume that while the distribution of inputs may change over time, the labeling 
function, i.e., the conditional distribution P(y | x) does not change. Statisticians call this 
covariate shift because the problem arises due to a shift in the distribution of the covari- 
ates (features). While we can sometimes reason about distribution shift without invoking 
causality, we note that covariate shift is the natural assumption to invoke in settings where 
we believe that x causes y. 


Consider the challenge of distinguishing cats and dogs. Our training data might consist of 
images of the kind in Fig. 4.7.1. 


At test time we are asked to classify the images in Fig. 4.7.2. 


The training set consists of photos, while the test set contains only cartoons. Training on a 
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cat cat dog dog 


Training data for distinguishing cats and dogs (illustrations: Lafeez Hossain / 500px / 
Getty Images; ilkermetinkursova / iStock / Getty Images Plus; GlobalP / iStock / Getty 
Images Plus; Musthafa Aboobakuru / 500px / Getty Images). 


cat cat dog dog 


Test data for distinguishing cats and dogs (illustrations: SIBAS_minich / iStock / Getty 
Images Plus; Ghrzuzudu / iStock / Getty Images Plus; id-work / Digital Vision Vectors / 
Getty Images; Yime / iStock / Getty Images Plus). 


dataset with substantially different characteristics from the test set can spell trouble absent 
a coherent plan for how to adapt to the new domain. 


Label Shift 


Label shift describes the converse problem. Here, we assume that the label marginal P(y) 
can change but the class-conditional distribution P(x | y) remains fixed across domains. 
Label shift is a reasonable assumption to make when we believe that y causes x. For ex- 
ample, we may want to predict diagnoses given their symptoms (or other manifestations), 
even as the relative prevalence of diagnoses are changing over time. Label shift is the ap- 
propriate assumption here because diseases cause symptoms. In some degenerate cases the 
label shift and covariate shift assumptions can hold simultaneously. For example, when the 
label is deterministic, the covariate shift assumption will be satisfied, even when y causes 
x. Interestingly, in these cases, it is often advantageous to work with methods that flow 
from the label shift assumption. That is because these methods tend to involve manipulat- 
ing objects that look like labels (often low-dimensional), as opposed to objects that look 
like inputs, which tend to be high-dimensional in deep learning. 
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Concept Shift 


We may also encounter the related problem of concept shift, which arises when the very 
definitions of labels can change. This sounds weird—a cat is a cat, no? However, other 
categories are subject to changes in usage over time. Diagnostic criteria for mental illness, 
what passes for fashionable, and job titles, are all subject to considerable amounts of con- 
cept shift. It turns out that if we navigate around the United States, shifting the source of 
our data by geography, we will find considerable concept shift regarding the distribution of 
names for soft drinks as shown in Fig. 4.7.3. 


no data other pop coke 


CC-BY: Alan McConchie, PopVsSoda.com 


| Concept shift for soft drink names in the United States (CC-BY: Alan McConchie, 


PopVsSoda.com). 


If we were to build a machine translation system, the distribution P(y | x) might be dif- 
ferent depending on our location. This problem can be tricky to spot. We might hope to 
exploit knowledge that shift only takes place gradually either in a temporal or geographic 
sense. 


4.7.2 Examples of Distribution Shift 


Before delving into formalism and algorithms, we can discuss some concrete situations 
where covariate or concept shift might not be obvious. 


Medical Diagnostics 


Imagine that you want to design an algorithm to detect cancer. You collect data from healthy 
and sick people and you train your algorithm. It works fine, giving you high accuracy and 
you conclude that you are ready for a successful career in medical diagnostics. Not so 


fast. 


The distributions that gave rise to the training data and those you will encounter in the wild 
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might differ considerably. This happened to an unfortunate startup that some of we authors 
worked with years ago. They were developing a blood test for a disease that predominantly 
affects older men and hoped to study it using blood samples that they had collected from 
patients. However, it is considerably more difficult to obtain blood samples from healthy 
men than from sick patients already in the system. To compensate, the startup solicited 
blood donations from students on a university campus to serve as healthy controls in de- 
veloping their test. Then they asked whether we could help them to build a classifier for 
detecting the disease. 


As we explained to them, it would indeed be easy to distinguish between the healthy and 
sick cohorts with near-perfect accuracy. However, that is because the test subjects differed 
in age, hormone levels, physical activity, diet, alcohol consumption, and many more fac- 
tors unrelated to the disease. This was unlikely to be the case with real patients. Due to 
their sampling procedure, we could expect to encounter extreme covariate shift. Moreover, 
this case was unlikely to be correctable via conventional methods. In short, they wasted a 
significant sum of money. 


Self-Driving Cars 


Say a company wanted to leverage machine learning for developing self-driving cars. One 
key component here is a roadside detector. Since real annotated data is expensive to get, 
they had the (smart and questionable) idea to use synthetic data from a game rendering 
engine as additional training data. This worked really well on “test data” drawn from the 
rendering engine. Alas, inside a real car it was a disaster. As it turned out, the roadside had 
been rendered with a very simplistic texture. More importantly, all the roadside had been 
rendered with the same texture and the roadside detector learned about this “feature” very 
quickly. 


A similar thing happened to the US Army when they first tried to detect tanks in the forest. 
They took aerial photographs of the forest without tanks, then drove the tanks into the forest 
and took another set of pictures. The classifier appeared to work perfectly. Unfortunately, it 
had merely learned how to distinguish trees with shadows from trees without shadows—the 
first set of pictures was taken in the early morning, the second set at noon. 


Nonstationary Distributions 


A much more subtle situation arises when the distribution changes slowly (also known 
as nonstationary distribution) and the model is not updated adequately. Below are some 
typical cases. 


e We train a computational advertising model and then fail to update it frequently (e.g., we 
forget to incorporate that an obscure new device called an iPad was just launched). 


e We build a spam filter. It works well at detecting all spam that we have seen so far. But 
then the spammers wise up and craft new messages that look unlike anything we have 
seen before. 
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e We build a product recommendation system. It works throughout the winter but then 
continues to recommend Santa hats long after Christmas. 


More Anecdotes 


e We build a face detector. It works well on all benchmarks. Unfortunately it fails on test 
data—the offending examples are close-ups where the face fills the entire image (no 
such data was in the training set). 


e We build a web search engine for the US market and want to deploy it in the UK. 


e We train an image classifier by compiling a large dataset where each among a large set 
of classes is equally represented in the dataset, say 1000 categories, represented by 
1000 images each. Then we deploy the system in the real world, where the actual 
label distribution of photographs is decidedly non-uniform. 


4.7.3 Correction of Distribution Shift 


As we have discussed, there are many cases where training and test distributions P(x, y) 
are different. In some cases, we get lucky and the models work despite covariate, label, 
or concept shift. In other cases, we can do better by employing principled strategies to 
cope with the shift. The remainder of this section grows considerably more technical. The 
impatient reader could continue on to the next section as this material is not prerequisite to 
subsequent concepts. 


Empirical Risk and Risk 


Let’s first reflect on what exactly is happening during model training: we iterate over fea- 
tures and associated labels of training data {(x1, y1),..., (Xn, Yn) } and update the param- 
eters of a model f after every minibatch. For simplicity we do not consider regularization, 
so we largely minimize the loss on the training: 
minimize l 3 l(f (x), yi), (4.7.1) 
f n 


i=1 


where / is the loss function measuring “how bad” the prediction f (x;) is given the associ- 
ated label y;. Statisticians call the term in (4.7.1) empirical risk. The empirical risk is an 
average loss over the training data for approximating the risk, which is the expectation of 
the loss over the entire population of data drawn from their true distribution p(x, y): 


R E / J R E (4.7.2) 


However, in practice we typically cannot obtain the entire population of data. Thus, em- 
pirical risk minimization, which is minimizing the empirical risk in (4.7.1), is a practical 
strategy for machine learning, with the hope of approximately minimizing the risk. 


160 


Linear Neural Networks for Classification 


Covariate Shift Correction 


Assume that we want to estimate some dependency P(y | x) for which we have labeled data 
(Xi, yi). Unfortunately, the observations x; are drawn from some source distribution q(x) 
rather than the target distribution p(x). Fortunately, the dependency assumption means 
that the conditional distribution does not change: p(y | x) = q(y | x). If the source 
distribution q(x) is “wrong”, we can correct for that by using the following simple identity 
in the risk: 


J Í AO E E ay = / a I(F00).y)a(y | 940) 2S dxdy. 
(4.7.3) 


In other words, we need to reweigh each data example by the ratio of the probability that it 
would have been drawn from the correct distribution to that from the wrong one: 
def P(Xi) 


Br = OT (4.7.4) 


Plugging in the weight 6; for each data example (X;, y;) we can train our model using 
weighted empirical risk minimization: 


1 n 
minimize ~ 2 BISO), yi). (4.7.5) 


Alas, we do not know that ratio, so before we can do anything useful we need to estimate 
it. Many methods are available, including some fancy operator-theoretic approaches that 
attempt to recalibrate the expectation operator directly using a minimum-norm or a maxi- 
mum entropy principle. Note that for any such approach, we need samples drawn from both 
distributions—the “true” p, e.g., by access to test data, and the one used for generating the 
training set q (the latter is trivially available). Note however, that we only need features 
x ~ p(x); we do not need to access labels y ~ p(y). 


In this case, there exists a very effective approach that will give almost as good results 
as the original: namely, logistic regression, which is a special case of softmax regression 
(see Section 4.1) for binary classification. This is all that is needed to compute estimated 
probability ratios. We learn a classifier to distinguish between data drawn from p(x) and 
data drawn from q(x). If it is impossible to distinguish between the two distributions then 
it means that the associated instances are equally likely to come from either one of those 
two distributions. On the other hand, any instances that can be well discriminated should 
be significantly overweighted or underweighted accordingly. 


For simplicity’s sake assume that we have an equal number of instances from both distribu- 
tions p(x) and q(x), respectively. Now denote by z labels that are 1 for data drawn from p 
and —1 for data drawn from q. Then the probability in a mixed dataset is given by 


P(z=1 
Post po gene ee, (4.7.6) 
p(x) + q(x) P(z=-1|x) 4x) 
Thus, if we use a logistic regression approach, where P(z = 1 | x) = EEEN (hisa 
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parametrized function), it follows that 


z 1/0 + exp(-h(xi))) 

exp(—h(x;))/(1 + exp(—A(x;))) 
As a result, we need to solve two problems: the first, to distinguish between data drawn 
from both distributions, and then a weighted empirical risk minimization problem in (4.7.5) 
where we weigh terms by £i. 


Bi = exp(h(x;)). (4.7.7) 


Now we are ready to describe a correction algorithm. Suppose that we have a training set 
{(X1, y1); -- -> (Xn, Yn)} and an unlabeled test set {u;, ..., Um}. For covariate shift, we 
assume that x; for all 1 < i < n are drawn from some source distribution and u; for all 
1 < i < m are drawn from the target distribution. Here is a prototypical algorithm for 
correcting covariate shift: 


1. Create a binary-classification training set: {(x,,-1),..., (Xn, —1), (u1, 1),..., (Um, 1}. 
2. Train a binary classifier using logistic regression to get the function h. 


3. Weigh training data using 6; = exp(h(x;)) or better 8; = min(exp(h(x;)), c) for some 
constant c. 


4. Use weights £; for training on {(x1, y1),.--, (Xn, Yn)} in (4.7.5). 


Note that the above algorithm relies on a crucial assumption. For this scheme to work, we 
need that each data example in the target (e.g., test time) distribution had nonzero proba- 
bility of occurring at training time. If we find a point where p(x) > 0 but g(x) = 0, then 
the corresponding importance weight should be infinity. 


Label Shift Correction 


Assume that we are dealing with a classification task with k categories. Using the same 
notation in Section 4.7.3, g and p are the source distribution (e.g., training time) and target 
distribution (e.g., test time), respectively. Assume that the distribution of labels shifts over 
time: q(y) + p(y), but the class-conditional distribution stays the same: g(x | y) = p(x | 
y). If the source distribution g(y) is “wrong”, we can correct for that according to the 
following identity in the risk as defined in (4.7.2): 


I J E E EE i i 1F(®), ya | yay) om axay. 


q(y) 
(4.7.8) 
Here, our importance weights will correspond to the label likelihood ratios: 
af POD (4.7.9) 
q(yi) 


One nice thing about label shift is that if we have a reasonably good model on the source 
distribution, then we can get consistent estimates of these weights without ever having to 
deal with the ambient dimension. In deep learning, the inputs tend to be high-dimensional 
objects like images, while the labels are often simpler objects like categories. 


To estimate the target label distribution, we first take our reasonably good off-the-shelf 
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classifier (typically trained on the training data) and compute its “confusion” matrix using 
the validation set (also from the training distribution). The confusion matrix, C, is simply a 
k Xk matrix, where each column corresponds to the label category (ground truth) and each 
row corresponds to our model’s predicted category. Each cell’s value c;; is the fraction of 
total predictions on the validation set where the true label was j and our model predicted 
i. 


Now, we cannot calculate the confusion matrix on the target data directly because we do 
not get to see the labels for the examples that we see in the wild, unless we invest in a 
complex real-time annotation pipeline. What we can do, however, is average all of our 
model’s predictions at test time together, yielding the mean model outputs u() € R*, 
where the ¿™ element u(9;) is the fraction of the total predictions on the test set where our 
model predicted i. 


It turns out that under some mild conditions—if our classifier was reasonably accurate in 
the first place, and if the target data contains only categories that we have seen before, and 
if the label shift assumption holds in the first place (the strongest assumption here)—we 
can estimate the test set label distribution by solving a simple linear system 


Cp(y) = HY), (4.7.10) 


because as an estimate 4 cijp(yj) = Ci) holds for all 1 < i < k, where p(y;) is 
the j® element of the k-dimensional label distribution vector p(y). If our classifier is 
sufficiently accurate to begin with, then the confusion matrix C will be invertible, and we 
get a solution p(y) = C7! u(y). 


Because we observe the labels on the source data, it is easy to estimate the distribution 
q(y). Then, for any training example i with label y;, we can take the ratio of our esti- 
mated p(y;)/q(y;) to calculate the weight 6;, and plug this into weighted empirical risk 
minimization in (4.7.5). 


Concept Shift Correction 


Concept shift is much harder to fix in a principled manner. For instance, in a situation where 
suddenly the problem changes from distinguishing cats from dogs to one of distinguishing 
white from black animals, it will be unreasonable to assume that we can do much better 
than just collecting new labels and training from scratch. Fortunately, in practice, such 
extreme shifts are rare. Instead, what usually happens is that the task keeps on changing 
slowly. To make things more concrete, here are some examples: 


e In computational advertising, new products are launched, old products become less pop- 
ular. This means that the distribution over ads and their popularity changes gradually 
and any click-through rate predictor needs to change gradually with it. 


e Traffic camera lenses degrade gradually due to environmental wear, affecting image 
quality progressively. 


e News content changes gradually (i.e., most of the news remains unchanged but new sto- 
ries appear). 
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In such cases, we can use the same approach that we used for training networks to make 
them adapt to the change in the data. In other words, we use the existing network weights 
and simply perform a few update steps with the new data rather than training from scratch. 


4.7.4 A Taxonomy of Learning Problems 


Armed with knowledge about how to deal with changes in distributions, we can now con- 
sider some other aspects of machine learning problem formulation. 


Batch Learning 


In batch learning, we have access to training features and labels {(x1, y1),..-, (Xn, Yn) }, 
which we use to train a model f(x). Later on, we deploy this model to score new data (x, y) 
drawn from the same distribution. This is the default assumption for any of the problems 
that we discuss here. For instance, we might train a cat detector based on lots of pictures 
of cats and dogs. Once we have trained it, we ship it as part of a smart catdoor computer 
vision system that lets only cats in. This is then installed in a customer’s home and is never 
updated again (barring extreme circumstances). 


Online Learning 


Now imagine that the data (x;, y;) arrives one sample at a time. More specifically, assume 
that we first observe x;, then we need to come up with an estimate f(x;). Only once 
we have done this do we observe y; and so receive a reward or incur a loss, given our 
decision. Many real problems fall into this category. For example, we need to predict 
tomorrow’s stock price, which allows us to trade based on that estimate and at the end 
of the day we find out whether our estimate made us a profit. In other words, in online 
learning, we have the following cycle where we are continuously improving our model 
given new observations: 


model f, — data x, — estimate f;(x;) — (4.7.11) 
observation y, — loss I(y;, ft(x:)) — model fi+1 


Bandits 


Bandits are a special case of the problem above. While in most learning problems we have 
a continuously parametrized function f where we want to learn its parameters (e.g., a deep 
network), in a bandit problem we only have a finite number of arms that we can pull, i.e., 
a finite number of actions that we can take. It is not very surprising that for this simpler 
problem stronger theoretical guarantees in terms of optimality can be obtained. We list 
it mainly since this problem is often (confusingly) treated as if it were a distinct learning 
setting. 


164 


Linear Neural Networks for Classification 


Control 


In many cases the environment remembers what we did. Not necessarily in an adversarial 
manner but it will just remember and the response will depend on what happened before. 
For instance, a coffee boiler controller will observe different temperatures depending on 
whether it was heating the boiler previously. PID (proportional-integral-derivative) con- 
troller algorithms are a popular choice there. Likewise, a user’s behavior on a news site 
will depend on what we showed them previously (e.g., they will read most news only once). 
Many such algorithms form a model of the environment in which they act so as to make 
their decisions appear less random. Recently, control theory (e.g., PID variants) has also 
been used to automatically tune hyperparameters to achieve better disentangling and recon- 
struction quality, and improve the diversity of generated text and the reconstruction quality 
of generated images (Shao et al., 2020). 


Reinforcement Learning 


In the more general case of an environment with memory, we may encounter situations 
where the environment is trying to cooperate with us (cooperative games, in particular 
for non-zero-sum games), or others where the environment will try to win. Chess, Go, 
Backgammon, or StarCraft are some of the cases in reinforcement learning. Likewise, we 
might want to build a good controller for autonomous cars. Other cars are likely to respond 
to the autonomous car’s driving style in nontrivial ways, e.g., trying to avoid it, trying to 
cause an accident, or trying to cooperate with it. 


Considering the Environment 


One key distinction between the different situations above is that a strategy that might have 
worked throughout in the case of a stationary environment, might not work throughout in 
an environment that can adapt. For instance, an arbitrage opportunity discovered by a trader 
is likely to disappear once it is exploited. The speed and manner at which the environment 
changes determines to a large extent the type of algorithms that we can bring to bear. For 
instance, if we know that things may only change slowly, we can force any estimate to 
change only slowly, too. If we know that the environment might change instantaneously, 
but only very infrequently, we can make allowances for that. These types of knowledge are 
crucial for the aspiring data scientist in dealing with concept shift, i.e., when the problem 
that is being solved can change over time. 


4.7.5 Fairness, Accountability, and Transparency in Machine 
Learning 


Finally, it is important to remember that when you deploy machine learning systems you 
are not merely optimizing a predictive model—you are typically providing a tool that will 
be used to (partially or fully) automate decisions. These technical systems can impact the 
lives of individuals who are subject to the resulting decisions. The leap from considering 
predictions to making decisions raises not only new technical questions, but also a slew of 
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ethical questions that must be carefully considered. If we are deploying a medical diagnos- 
tic system, we need to know for which populations it may work and for which it may not. 
Overlooking foreseeable risks to the welfare of a subpopulation could cause us to adminis- 
ter inferior care. Moreover, once we contemplate decision-making systems, we must step 
back and reconsider how we evaluate our technology. Among other consequences of this 
change of scope, we will find that accuracy is seldom the right measure. For instance, when 
translating predictions into actions, we will often want to take into account the potential cost 
sensitivity of erring in various ways. If one way of misclassifying an image could be per- 
ceived as a racial sleight of hand, while misclassification to a different category would be 
harmless, then we might want to adjust our thresholds accordingly, accounting for societal 
values in designing the decision-making protocol. We also want to be careful about how 
prediction systems can lead to feedback loops. For example, consider predictive policing 
systems, which allocate patrol officers to areas with high forecasted crime. It is easy to see 
how a worrying pattern can emerge: 


1. Neighborhoods with more crime get more patrols. 


2. Consequently, more crimes are discovered in these neighborhoods, entering the training 
data available for future iterations. 


3. Exposed to more positives, the model predicts yet more crime in these neighborhoods. 


4. In the next iteration, the updated model targets the same neighborhood even more heav- 
ily leading to yet more crimes discovered, etc. 


Often, the various mechanisms by which a model’s predictions become coupled to its train- 
ing data are unaccounted for in the modeling process. This can lead to what researchers 
call runaway feedback loops. Additionally, we want to be careful about whether we are 
addressing the right problem in the first place. Predictive algorithms now play an outsize 
role in mediating the dissemination of information. Should the news that an individual en- 
counters be determined by the set of Facebook pages they have Liked? These are just a few 
among the many pressing ethical dilemmas that you might encounter in a career in machine 
learning. 


4.7.6 Summary 


In many cases training and test sets do not come from the same distribution. This is called 
distribution shift. The risk is the expectation of the loss over the entire population of data 
drawn from their true distribution. However, this entire population is usually unavailable. 
Empirical risk is an average loss over the training data to approximate the risk. In practice, 
we perform empirical risk minimization. 


Under the corresponding assumptions, covariate and label shift can be detected and cor- 
rected for at test time. Failure to account for this bias can become problematic at test time. 
In some cases, the environment may remember automated actions and respond in surprising 
ways. We must account for this possibility when building models and continue to moni- 
tor live systems, open to the possibility that our models and the environment will become 
entangled in unanticipated ways. 
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4.7.7 Exercises 
1. What could happen when we change the behavior of a search engine? What might the 
users do? What about the advertisers? 
2. Implement a covariate shift detector. Hint: build a classifier. 


3. Implement a covariate shift corrector. 


4. Besides distribution shift, what else could affect how the empirical risk approximates 
the risk? 


Discussions !°!. 
101 
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In this chapter, we will introduce your first truly deep network. The simplest deep networks 
are called multilayer perceptrons, and they consist of multiple layers of neurons each fully 
connected to those in the layer below (from which they receive input) and those above 
(which they, in turn, influence). Although automatic differentiation significantly simplifies 
the implementation of deep learning algorithms, we will dive deep into how these gradi- 
ents are calculated in deep networks. Then we will be ready to discuss issues relating to 
numerical stability and parameter initialization that are key to successfully training deep 
networks. When we train such high-capacity models we run the risk of overfitting. Thus, 
we will revisit regularization and generalization for deep networks. Throughout, we aim to 
give you a firm grasp not just of the concepts but also of the practice of using deep networks. 
At the end of this chapter, we apply what we have introduced so far to a real case: house 
price prediction. We punt matters relating to the computational performance, scalability, 
and efficiency of our models to subsequent chapters. 


5.1 Multilayer Perceptrons 
SSS 


In Section 4.1, we introduced softmax regression, implementing the algorithm from scratch 
(Section 4.4) and using high-level APIs (Section 4.5). This allowed us to train classifiers ca- 
pable of recognizing 10 categories of clothing from low-resolution images. Along the way, 
we learned how to wrangle data, coerce our outputs into a valid probability distribution, 
apply an appropriate loss function, and minimize it with respect to our model’s parameters. 
Now that we have mastered these mechanics in the context of simple linear models, we 
can launch our exploration of deep neural networks, the comparatively rich class of models 
with which this book is primarily concerned. 


%matplotlib inline 
import torch 
from d21 import torch as d21 


5.1.1 Hidden Layers 


We described affine transformations in Section 3.1.1 as linear transformations with added 
bias. To begin, recall the model architecture corresponding to our softmax regression ex- 
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ample, illustrated in Fig. 4.1.1. This model maps inputs directly to outputs via a single 
affine transformation, followed by a softmax operation. If our labels truly were related to 
the input data by a simple affine transformation, then this approach would be sufficient. 
However, linearity (in affine transformations) is a strong assumption. 


Limitations of Linear Models 


For example, linearity implies the weaker assumption of monotonicity, i.e., that any in- 
crease in our feature must either always cause an increase in our model’s output (if the 
corresponding weight is positive), or always cause a decrease in our model’s output (if 
the corresponding weight is negative). Sometimes that makes sense. For example, if we 
were trying to predict whether an individual will repay a loan, we might reasonably assume 
that all other things being equal, an applicant with a higher income would always be more 
likely to repay than one with a lower income. While monotonic, this relationship likely 
is not linearly associated with the probability of repayment. An increase in income from 
$0 to $50,000 likely corresponds to a bigger increase in likelihood of repayment than an 
increase from $1 million to $1.05 million. One way to handle this might be to postprocess 
our outcome such that linearity becomes more plausible, by using the logistic map (and 
thus the logarithm of the probability of outcome). 


Note that we can easily come up with examples that violate monotonicity. Say for example 
that we want to predict health as a function of body temperature. For individuals with a 
normal body temperature above 37°C (98.6°F), higher temperatures indicate greater risk. 
However, if the body temperatures drops below 37°C, lower temperatures indicate greater 
risk! Again, we might resolve the problem with some clever preprocessing, such as using 
the distance from 37°C as a feature. 


But what about classifying images of cats and dogs? Should increasing the intensity of the 
pixel at location (13, 17) always increase (or always decrease) the likelihood that the image 
depicts a dog? Reliance on a linear model corresponds to the implicit assumption that the 
only requirement for differentiating cats and dogs is to assess the brightness of individual 
pixels. This approach is doomed to fail in a world where inverting an image preserves the 
category. 


And yet despite the apparent absurdity of linearity here, as compared with our previous 
examples, it is less obvious that we could address the problem with a simple preprocessing 
fix. That is, because the significance of any pixel depends in complex ways on its context 
(the values of the surrounding pixels). While there might exist a representation of our data 
that would take into account the relevant interactions among our features, on top of which 
a linear model would be suitable, we simply do not know how to calculate it by hand. With 
deep neural networks, we used observational data to jointly learn both a representation via 
hidden layers and a linear predictor that acts upon that representation. 


This problem of nonlinearity has been studied for at least a century (Fisher, 1925). For 
instance, decision trees in their most basic form use a sequence of binary decisions to de- 
cide upon class membership (Quinlan, 1993). Likewise, kernel methods have been used 
for many decades to model nonlinear dependencies (Aronszajn, 1950). This has found its 
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way into nonparametric spline models (Wahba, 1990) and kernel methods (Schélkopf and 
Smola, 2002). It is also something that the brain solves quite naturally. After all, neu- 
rons feed into other neurons which, in turn, feed into other neurons again (Ramón y Cajal 
and Azoulay, 1894). Consequently we have a sequence of relatively simple transforma- 
tions. 


Incorporating Hidden Layers 


We can overcome the limitations of linear models by incorporating one or more hidden 
layers. The easiest way to do this is to stack many fully connected layers on top of one 
another. Each layer feeds into the layer above it, until we generate outputs. We can think of 
the first L — 1 layers as our representation and the final layer as our linear predictor. This 
architecture is commonly called a multilayer perceptron, often abbreviated as MLP (Fig. 
5.1.1). 


Output layer 


KC 


Hidden layer 


Input layer 


An MLP with a hidden layer of five hidden units. 


This MLP has four inputs, three outputs, and its hidden layer contains five hidden units. 
Since the input layer does not involve any calculations, producing outputs with this network 
requires implementing the computations for both the hidden and output layers; thus, the 
number of layers in this MLP is two. Note that both layers are fully connected. Every 
input influences every neuron in the hidden layer, and each of these in turn influences every 
neuron in the output layer. Alas, we are not quite done yet. 


From Linear to Nonlinear 


As before, we denote by the matrix X € R”*¢ a minibatch of n examples where each exam- 
ple has d inputs (features). For a one-hidden-layer MLP whose hidden layer has h hidden 
units, we denote by H € R”*" the outputs of the hidden layer, which are hidden represen- 
tations. Since the hidden and output layers are both fully connected, we have hidden-layer 
weights WC) e R@* and biases b} e€ R!*} and output-layer weights WC) € R’*4 and 
biases b?) € R!*4_ This allows us to calculate the outputs O € R”*4 of the one-hidden- 


layer MLP as follows: 
H=Xw) +b, 
O=HW® +b, en 


Note that after adding the hidden layer, our model now requires us to track and update 
additional sets of parameters. So what have we gained in exchange? You might be surprised 
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to find out that—in the model defined above—we gain nothing for our troubles! The reason 
is plain. The hidden units above are given by an affine function of the inputs, and the outputs 
(pre-softmax) are just an affine function of the hidden units. An affine function of an affine 
function is itself an affine function. Moreover, our linear model was already capable of 
representing any affine function. 


To see this formally we can just collapse out the hidden layer in the above definition, yielding 
an equivalent single-layer model with parameters W = W) W®) and b = bD W) + 
b®): 


O = (XW) + bY) W? +b? =XWOW? 4 bOW® +b? = XW +b. 
(5.1.2) 


In order to realize the potential of multilayer architectures, we need one more key ingre- 
dient: a nonlinear activation function o to be applied to each hidden unit following the 
affine transformation. For instance, a popular choice is the ReLU (rectified linear unit) ac- 
tivation function (Nair and Hinton, 2010) o(x) = max(0,x) operating on its arguments 
elementwise. The outputs of activation functions o(-) are called activations. In general, 
with activation functions in place, it is no longer possible to collapse our MLP into a linear 
model: 


H =o0(XW") +b), 


O=HW”) +b”. ane 


Since each row in X corresponds to an example in the minibatch, with some abuse of 
notation, we define the nonlinearity o to apply to its inputs in a rowwise fashion, i.e., one 
example at a time. Note that we used the same notation for softmax when we denoted a 
rowwise operation in Section 4.1.1. Quite frequently the activation functions we use apply 
not merely rowwise but elementwise. That means that after computing the linear portion of 
the layer, we can calculate each activation without looking at the values taken by the other 
hidden units. 


To build more general MLPs, we can continue stacking such hidden layers, e.g., H“!) = 
o(XW +b) and H”? = o)(H WO +b), one atop another, yielding ever more 
expressive models. 


Universal Approximators 


We know that the brain is capable of very sophisticated statistical analysis. As such, it is 
worth asking, just how powerful a deep network could be. This question has been answered 
multiple times, e.g., in Cybenko (1989) in the context of MLPs, and in Micchelli (1984) in 
the context of reproducing kernel Hilbert spaces in a way that could be seen as radial basis 
function (RBF) networks with a single hidden layer. These (and related results) suggest that 
even with a single-hidden-layer network, given enough nodes (possibly absurdly many), 
and the right set of weights, we can model any function. Actually learning that function 
is the hard part, though. You might think of your neural network as being a bit like the 
C programming language. The language, like any other modern language, is capable of 
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expressing any computable program. But actually coming up with a program that meets 
your specifications is the hard part. 


Moreover, just because a single-hidden-layer network can learn any function does not mean 
that you should try to solve all of your problems with one. In fact, in this case kernel 
methods are way more effective, since they are capable of solving the problem exactly even 
in infinite dimensional spaces (Kimeldorf and Wahba, 1971, Schdélkopf et al., 2001). In 
fact, we can approximate many functions much more compactly by using deeper (rather 
than wider) networks (Simonyan and Zisserman, 2014). We will touch upon more rigorous 
arguments in subsequent chapters. 


5.1.2 Activation Functions 


Activation functions decide whether a neuron should be activated or not by calculating the 
weighted sum and further adding bias to it. They are differentiable operators for trans- 
forming input signals to outputs, while most of them add nonlinearity. Because activation 
functions are fundamental to deep learning, let’s briefly survey some common ones. 


ReLU Function 


The most popular choice, due to both simplicity of implementation and its good perfor- 
mance on a variety of predictive tasks, is the rectified linear unit (ReLU) (Nair and Hinton, 
2010). ReLU provides a very simple nonlinear transformation. Given an element x, the 
function is defined as the maximum of that element and 0: 


ReLU(x) = max(x, 0). (5.1.4) 


Informally, the ReLU function retains only positive elements and discards all negative el- 
ements by setting the corresponding activations to 0. To gain some intuition, we can plot 
the function. As you can see, the activation function is piecewise linear. 


x = torch.arange(-8.@, 8.0, 0.1, requires_grad=True) 
y = torch.relu(x) 
d21.plot(x.detach(), y.detach(), ‘x’, ‘relu(x)’, figsize=(5, 2.5)) 


relu(x) 
D 


When the input is negative, the derivative of the ReLU function is 0, and when the input 
is positive, the derivative of the ReLU function is 1. Note that the ReLU function is not 
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differentiable when the input takes value precisely equal to 0. In these cases, we default to 
the left-hand-side derivative and say that the derivative is 0 when the input is 0. We can 
get away with this because the input may never actually be zero (mathematicians would say 
that it is nondifferentiable on a set of measure zero). There is an old adage that if subtle 
boundary conditions matter, we are probably doing (real) mathematics, not engineering. 
That conventional wisdom may apply here, or at least, the fact that we are not performing 
constrained optimization (Mangasarian, 1965, Rockafellar, 1970). We plot the derivative 
of the ReLU function below. 


y. backward(torch.ones_like(x), retain_graph=True) 
d21.plot(x.detach(), x.grad, 'x', ‘grad of relu’, figsize=(5, 2.5)) 


Or sp 1S ie 
BD © o 
f fi f 1 


grad of relu 


a 
N 


4 
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1 


The reason for using ReLU is that its derivatives are particularly well behaved: either they 
vanish or they just let the argument through. This makes optimization better behaved and 
it mitigated the well-documented problem of vanishing gradients that plagued previous 
versions of neural networks (more on this later). 


Note that there are many variants to the ReLU function, including the parametrized ReLU 
(pReLU) function (He et al., 2015). This variation adds a linear term to ReLU, so some 
information still gets through, even when the argument is negative: 


pReLU(x) = max(0, x) + œ min(0, x). (5.1.5) 


Sigmoid Function 


The sigmoid function transforms those inputs whose values lie in the domain R, to outputs 
that lie on the interval (0, 1). For that reason, the sigmoid is often called a squashing func- 
tion: it squashes any input in the range (-inf, inf) to some value in the range (0, 1): 


1 


sigmoid(x) = T+exp(—x) . 


(5.1.6) 
In the earliest neural networks, scientists were interested in modeling biological neurons 
that either fire or do not fire. Thus the pioneers of this field, going all the way back to 
McCulloch and Pitts, the inventors of the artificial neuron, focused on thresholding units 
(McCulloch and Pitts, 1943). A thresholding activation takes value 0 when its input is 
below some threshold and value 1 when the input exceeds the threshold. 
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When attention shifted to gradient-based learning, the sigmoid function was a natural choice 
because it is a smooth, differentiable approximation to a thresholding unit. Sigmoids are 
still widely used as activation functions on the output units when we want to interpret the 
outputs as probabilities for binary classification problems: you can think of the sigmoid as a 
special case of the softmax. However, the sigmoid has largely been replaced by the simpler 
and more easily trainable ReLU for most use in hidden layers. Much of this has to do with 
the fact that the sigmoid poses challenges for optimization (LeCun et al., 1998) since its 
gradient vanishes for large positive and negative arguments. This can lead to plateaus that 
are difficult to escape from. Nonetheless sigmoids are important. In later chapters (e.g., 
Section 10.1) on recurrent neural networks, we will describe architectures that leverage 
sigmoid units to control the flow of information across time. 


Below, we plot the sigmoid function. Note that when the input is close to 0, the sigmoid 
function approaches a linear transformation. 


y = torch. sigmoid(x) 
d21.plot(x.detach(), y.detach(), 'x', ‘sigmoid(x)’, figsize=(5, 2.5)) 


1.07 


sigmoid(x) 


A 
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The derivative of the sigmoid function is given by the following equation: 


exp(—x) 
(1 + exp(-x))? 
The derivative of the sigmoid function is plotted below. Note that when the input is 0, the 
derivative of the sigmoid function reaches a maximum of 0.25. As the input diverges from 
O in either direction, the derivative approaches 0. 


< sigmoid(x) = = sigmoid(x) (1 — sigmoid(x)) . (5.1.7) 
x 


# Clear out previous gradients 

x.grad.data.zero_() 

y. backward(torch. ones_like(x) , retain_graph=True) 

d21.plot(x.detach(), x.grad, 'x’, ‘grad of sigmoid’, figsize=(5, 2.5)) 


Tanh Function 


Like the sigmoid function, the tanh (hyperbolic tangent) function also squashes its inputs, 
transforming them into elements on the interval between —1 and 1: 


1 — exp(—2x) 


tanh(x) = T+ exp(—2x) . 


(5.1.8) 
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We plot the tanh function below. Note that as input nears 0, the tanh function approaches a 
linear transformation. Although the shape of the function is similar to that of the sigmoid 
function, the tanh function exhibits point symmetry about the origin of the coordinate sys- 
tem (Kalman and Kwasny, 1992). 


y = torch. tanh(x) 
d21.plot(x.detach(), y.detach(), ‘x’, ‘tanh(x)’, figsize=(5, 2.5)) 


1.04 
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The derivative of the tanh function is: 
d 
cE tanh(x) = 1 — tanh? (x). (5.1.9) 
Ix 


It is plotted below. As the input nears 0, the derivative of the tanh function approaches a 
maximum of 1. And as we saw with the sigmoid function, as input moves away from 0 in 
either direction, the derivative of the tanh function approaches 0. 


# Clear out previous gradients 

x.grad.data.zero_() 

y. backward(torch. ones_like(x) , retain_graph=True) 
d21.plot(x.detach(), x.grad, ‘x’, ‘grad of tanh’, figsize=(5, 2.5)) 


5.1.3 Summary and Discussion 


We now know how to incorporate nonlinearities to build expressive multilayer neural net- 
work architectures. As a side note, your knowledge already puts you in command of a sim- 
ilar toolkit to a practitioner circa 1990. In some ways, you have an advantage over anyone 


175 


Multilayer Perceptrons 


1.07 


0.85 


0.64 


0.44 


grad of tanh 


0.24 
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working back then, because you can leverage powerful open-source deep learning frame- 
works to build models rapidly, using only a few lines of code. Previously, training these 
networks required researchers to code up layers and derivatives explicitly in C, Fortran, or 
even Lisp (in the case of LeNet). 


A secondary benefit is that ReLU is significantly more amenable to optimization than the 
sigmoid or the tanh function. One could argue that this was one of the key innovations that 
helped the resurgence of deep learning over the past decade. Note, though, that research in 
activation functions has not stopped. For instance, the GELU (Gaussian error linear unit) 
activation function x®(x) by Hendrycks and Gimpel (2016) (®(x) is the standard Gaussian 
cumulative distribution function) and the Swish activation function o (x) = x sigmoid(Bx) 
as proposed in Ramachandran et al. (2017) can yield better accuracy in many cases. 


5.1.4 Exercises 


1. Show that adding layers to a linear deep network, i.e., a network without nonlinearity 
o can never increase the expressive power of the network. Give an example where it 
actively reduces it. 


2. Compute the derivative of the pReLU activation function. 
3. Compute the derivative of the Swish activation function x sigmoid(x). 


4. Show that an MLP using only ReLU (or pReLU) constructs a continuous piecewise 
linear function. 


5. Sigmoid and tanh are very similar. 
1. Show that tanh(x) + 1 = 2 sigmoid(2x). 


2. Prove that the function classes parametrized by both nonlinearities are identical. 
Hint: affine layers have bias terms, too. 


6. Assume that we have a nonlinearity that applies to one minibatch at a time, such as the 
batch normalization (Ioffe and Szegedy, 2015). What kinds of problems do you expect 
this to cause? 


7. Provide an example where the gradients vanish for the sigmoid activation function. 


Discussions !?. 
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5.2 Implementation of Multilayer Perceptrons 
SSS SS ry) 


Multilayer perceptrons (MLPs) are not much more complex to implement than simple linear 
models. The key conceptual difference is that we now concatenate multiple layers. 


import torch 
from torch import nn 
from d21 import torch as d21 


5.2.1 Implementation from Scratch 


Let’s begin again by implementing such a network from scratch. 


Initializing Model Parameters 


Recall that Fashion-MNIST contains 10 classes, and that each image consists of a 28 x 
28 = 784 grid of grayscale pixel values. As before we will disregard the spatial structure 
among the pixels for now, so we can think of this as a classification dataset with 784 input 
features and 10 classes. To begin, we will implement an MLP with one hidden layer and 256 
hidden units. Both the number of layers and their width are adjustable (they are considered 
hyperparameters). Typically, we choose the layer widths to be divisible by larger powers of 
2. This is computationally efficient due to the way memory is allocated and addressed in 
hardware. 


Again, we will represent our parameters with several tensors. Note that for every layer, we 
must keep track of one weight matrix and one bias vector. As always, we allocate memory 
for the gradients of the loss with respect to these parameters. 


In the code below we use nn.Parameter to automatically register a class attribute as a 
parameter to be tracked by autograd (Section 2.5). 


class MLPScratch(d21.Classifier): 
def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=.01): 
SUPER Or aa lnitan ©) 
self.save_hyperparameters() 
self.W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma) 


self.b1 = nn.Parameter(torch. zeros (num_hiddens) ) 
self .W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs) * sigma) 
self.b2 = nn.Parameter (torch. zeros(num_outputs) ) 


Model 


To make sure we know how everything works, we will implement the ReLU activation 
ourselves rather than invoking the built-in relu function directly. 
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def relu(X): 
a = torch. zeros_like(X) 
return torch.max(X, a) 


Since we are disregarding spatial structure, we reshape each two-dimensional image into 
a flat vector of length num_inputs. Finally, we implement our model with just a few lines 
of code. Since we use the framework built-in autograd this is all that it takes. 


@d21.add_to_class(MLPScratch) 

def forward(self, X): 
X = X.reshape((-1, self.num_inputs)) 
H = relu(torch.matmul(X, self.W1) + self.b1) 
return torch.matmul(H, self.W2) + self.b2 


Training 


Fortunately, the training loop for MLPs is exactly the same as for softmax regression. 
We define the model, data, and trainer, then finally invoke the fit method on model and 
data. 


model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, l1r=0.1) 
data = d21.FashionMNIST (batch_size=256) 

trainer = d21.Trainer(max_epochs=10) 

trainer.fit(model, data) 


— train_loss 
1.25 =-=- val_loss 
=- val_acc 


5.2.2 Concise Implementation 


As you might expect, by relying on the high-level APIs, we can implement MLPs even 
more concisely. 


Model 


Compared with our concise implementation of softmax regression implementation (Section 
4.5), the only difference is that we add two fully connected layers where we previously added 
only one. The first is the hidden layer, the second is the output layer. 
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class MLP(d21.Classifier): 
def __init__(self, num_outputs, num_hiddens, Ir): 
súper (). init O 
self.save_hyperparameters() 
self.net = nn.Sequential(nn.Flatten(), nn.LazyLinear(num_hiddens) , 
nn.ReLU(), nn.LazyLinear(num_outputs) ) 


Previously, we defined forward methods for models to transform input using the model 
parameters. These operations are essentially a pipeline: you take an input and apply a 
transformation (e.g., matrix multiplication with weights followed by bias addition), then 
repetitively use the output of the current transformation as input to the next transforma- 
tion. However, you may have noticed that no forward method is defined here. In fact, 
MLP inherits the forward method from the Module class (Section 3.2.2) to simply invoke 
self .net(X) (X is input), which is now defined as a sequence of transformations via the 
Sequential class. The Sequential class abstracts the forward process enabling us to fo- 
cus on the transformations. We will further discuss how the Sequential class works in 
Section 6.1.2. 


Training 


The training loop is exactly the same as when we implemented softmax regression. This 
modularity enables us to separate matters concerning the model architecture from orthog- 
onal considerations. 


model = MLP(num_outputs=10, num_hiddens=256, 1r=0.1) 
trainer.fit(model, data) 


— train_loss 

=-=- val_loss 

=== val_acc 
= x 


5.2.3 Summary 


Now that we have more practice in designing deep networks, the step from a single to mul- 
tiple layers of deep networks does not pose such a significant challenge any longer. In 
particular, we can reuse the training algorithm and data loader. Note, though, that imple- 
menting MLPs from scratch is nonetheless messy: naming and keeping track of the model 
parameters makes it difficult to extend models. For instance, imagine wanting to insert 
another layer between layers 42 and 43. This might now be layer 42b, unless we are willing 
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to perform sequential renaming. Moreover, if we implement the network from scratch, it 
is much more difficult for the framework to perform meaningful performance optimiza- 
tions. 


Nonetheless, you have now reached the state of the art of the late 1980s when fully con- 
nected deep networks were the method of choice for neural network modeling. Our next 
conceptual step will be to consider images. Before we do so, we need to review a number 
of statistical basics and details on how to compute models efficiently. 


8. 


9. 


5.2.4 Exercises 


. Change the number of hidden units num_hiddens and plot how its number affects the 


accuracy of the model. What is the best value of this hyperparameter? 


. Try adding a hidden layer to see how it affects the results. 
. Why is it a bad idea to insert a hidden layer with a single neuron? What could go wrong? 


. How does changing the learning rate alter your results? With all other parameters fixed, 


which learning rate gives you the best results? How does this relate to the number of 
epochs? 


. Let’s optimize over all hyperparameters jointly, i.e., learning rate, number of epochs, 


number of hidden layers, and number of hidden units per layer. 
1. What is the best result you can get by optimizing over all of them? 
2. Why it is much more challenging to deal with multiple hyperparameters? 


3. Describe an efficient strategy for optimizing over multiple parameters jointly. 


. Compare the speed of the framework and the from-scratch implementation for a chal- 


lenging problem. How does it change with the complexity of the network? 


. Measure the speed of tensor—matrix multiplications for well-aligned and misaligned 


matrices. For instance, test for matrices with dimension 1024, 1025, 1026, 1028, and 
1032. 


1. How does this change between GPUs and CPUs? 
2. Determine the memory bus width of your CPU and GPU. 
Try out different activation functions. Which one works best? 


Is there a difference between weight initializations of the network? Does it matter? 


103 
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5.3 Forward Propagation, Backward Propagation, 
and Computational Graphs 


So far, we have trained our models with minibatch stochastic gradient descent. However, 
when we implemented the algorithm, we only worried about the calculations involved in 
forward propagation through the model. When it came time to calculate the gradients, we 
just invoked the backpropagation function provided by the deep learning framework. 


The automatic calculation of gradients profoundly simplifies the implementation of deep 
learning algorithms. Before automatic differentiation, even small changes to complicated 
models required recalculating complicated derivatives by hand. Surprisingly often, aca- 
demic papers had to allocate numerous pages to deriving update rules. While we must 
continue to rely on automatic differentiation so we can focus on the interesting parts, you 
ought to know how these gradients are calculated under the hood if you want to go beyond 
a shallow understanding of deep learning. 


In this section, we take a deep dive into the details of backward propagation (more com- 
monly called backpropagation). To convey some insight for both the techniques and their 
implementations, we rely on some basic mathematics and computational graphs. To start, 
we focus our exposition on a one-hidden-layer MLP with weight decay (2 regularization, 
to be described in subsequent chapters). 


5.3.1 Forward Propagation 


Forward propagation (or forward pass) refers to the calculation and storage of intermediate 
variables (including outputs) for a neural network in order from the input layer to the output 
layer. We now work step-by-step through the mechanics of a neural network with one 
hidden layer. This may seem tedious but in the eternal words of funk virtuoso James Brown, 
you must “pay the cost to be the boss”. 


For the sake of simplicity, let’s assume that the input example is x € R? and that our hidden 
layer does not include a bias term. Here the intermediate variable is: 


z= Wx, (5.3.1) 


where W‘)) e R”*4 is the weight parameter of the hidden layer. After running the inter- 
mediate variable z € R” through the activation function ¢ we obtain our hidden activation 
vector of length h: 


h = (2). (5.3.2) 


The hidden layer output h is also an intermediate variable. Assuming that the parameters 
of the output layer possess only a weight of W) € R4*", we can obtain an output layer 
variable with a vector of length q: 


o= Wh. (5.3.3) 
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Assuming that the loss function is / and the example label is y, we can then calculate the 
loss term for a single data example, 


L =1(0,y). (5.3.4) 


As we will see the definition of £2 regularization to be introduced later, given the hyperpa- 
rameter A, the regularization term is 


a 
s= 5 (WOR + we), (5.3.5) 


where the Frobenius norm of the matrix is simply the £2 norm applied after flattening the 
matrix into a vector. Finally, the model’s regularized loss on a given data example is: 


J=L+s. (5.3.6) 


We refer to J as the objective function in the following discussion. 


5.3.2 Computational Graph of Forward Propagation 


Plotting computational graphs helps us visualize the dependencies of operators and vari- 
ables within the calculation. Fig. 5.3.1 contains the graph associated with the simple net- 
work described above, where squares denote variables and circles denote operators. The 
lower-left corner signifies the input and the upper-right corner is the output. Notice that 
the directions of the arrows (which illustrate data flow) are primarily rightward and up- 
ward. 


s 
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Computational graph of forward propagation. 
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5.3.3 Backpropagation 


Backpropagation refers to the method of calculating the gradient of neural network param- 
eters. In short, the method traverses the network in reverse order, from the output to the 
input layer, according to the chain rule from calculus. The algorithm stores any interme- 
diate variables (partial derivatives) required while calculating the gradient with respect to 
some parameters. Assume that we have functions Y = f(X) and Z = g(Y), in which the 
input and the output X, Y, Z are tensors of arbitrary shapes. By using the chain rule, we can 
compute the derivative of Z with respect to X via 


OZ _ oa | 22. 2Y 
ax Pe Vay ax! 


Here we use the prod operator to multiply its arguments after the necessary operations, 


(5.3.7) 


such as transposition and swapping input positions, have been carried out. For vectors, 
this is straightforward: it is simply matrix—matrix multiplication. For higher dimensional 
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tensors, we use the appropriate counterpart. The operator prod hides all the notational 
overhead. 


Recall that the parameters of the simple network with one hidden layer, whose computa- 
tional graph is in Fig. 5.3.1, are WC) and W). The objective of backpropagation is to 
calculate the gradients 0J/0W' and 0J/0W'). To accomplish this, we apply the chain 
rule and calculate, in turn, the gradient of each intermediate variable and parameter. The 
order of calculations are reversed relative to those performed in forward propagation, since 
we need to start with the outcome of the computational graph and work our way towards 
the parameters. The first step is to calculate the gradients of the objective function J = L+s 
with respect to the loss term L and the regularization term s: 


oJ ðJ 
= ears 5.3.8 
T 1 and ae 1 ( ) 


Next, we compute the gradient of the objective function with respect to variable of the 
output layer o according to the chain rule: 


oJ ðJ OL OL 
eS maae ee a q 
Jo rod Ga 4 ERI. (5.3.9) 


Next, we calculate the gradients of the regularization term with respect to both parame- 
ters: 


os =AW and gs 


aa aoe AW”), (5.3.10) 


Now we are able to calculate the gradient 0J/9W™ e RY" of the model parameters 
closest to the output layer. Using the chain rule yields: 


oJ (Z ðo ) 


ðo 


= ðJ Os oð] x (2) 
awe) = pro Jo’ JW® + proa| : \- h +24Ww’. (5.3.11) 


To obtain the gradient with respect to W“!) we need to continue backpropagation along 
the output layer to the hidden layer. The gradient with respect to the hidden layer output 
0J/dh € R” is given by 


OF ore (Z 2e) = wer. (5.3.12) 


ðh do’ ðn) ðo 
Since the activation function ¢ applies elementwise, calculating the gradient ôJ/ðz € R” 


of the intermediate variable z requires that we use the elementwise multiplication operator, 
which we denote by ©: 


aT ðh\ ðJ, 


Finally, we can obtain the gradient ôJ/3W (1 € R”*4 of the model parameters closest to 
the input layer. According to the chain rule, we get 


ðJ ðJ ðz ðJ ðs \_ as 
aw = prod (3, aw] + prod (2, san =z% taw”. (5.3.14) 
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5.3.4 Training Neural Networks 


When training neural networks, forward and backward propagation depend on each other. 
In particular, for forward propagation, we traverse the computational graph in the direc- 
tion of dependencies and compute all the variables on its path. These are then used for 
backpropagation where the compute order on the graph is reversed. 


Take the aforementioned simple network as an illustrative example. On the one hand, com- 
puting the regularization term (5.3.5) during forward propagation depends on the current 
values of model parameters W“!) and W). They are given by the optimization algorithm 
according to backpropagation in the most recent iteration. On the other hand, the gradient 
calculation for the parameter (5.3.11) during backpropagation depends on the current value 
of the hidden layer output h, which is given by forward propagation. 


Therefore when training neural networks, once model parameters are initialized, we alter- 
nate forward propagation with backpropagation, updating model parameters using gradi- 
ents given by backpropagation. Note that backpropagation reuses the stored intermediate 
values from forward propagation to avoid duplicate calculations. One of the consequences 
is that we need to retain the intermediate values until backpropagation is complete. This is 
also one of the reasons why training requires significantly more memory than plain predic- 
tion. Besides, the size of such intermediate values is roughly proportional to the number of 
network layers and the batch size. Thus, training deeper networks using larger batch sizes 
more easily leads to out-of-memory errors. 


5.3.5 Summary 


Forward propagation sequentially calculates and stores intermediate variables within the 
computational graph defined by the neural network. It proceeds from the input to the out- 
put layer. Backpropagation sequentially calculates and stores the gradients of intermediate 
variables and parameters within the neural network in the reversed order. When training 
deep learning models, forward propagation and backpropagation are interdependent, and 
training requires significantly more memory than prediction. 


5.3.6 Exercises 


1. Assume that the inputs X to some scalar function f are n Xx m matrices. What is the 
dimensionality of the gradient of f with respect to X? 


2. Add a bias to the hidden layer of the model described in this section (you do not need 
to include bias in the regularization term). 


1. Draw the corresponding computational graph. 
2. Derive the forward and backward propagation equations. 


3. Compute the memory footprint for training and prediction in the model described in this 
section. 


4. Assume that you want to compute second derivatives. What happens to the computa- 
tional graph? How long do you expect the calculation to take? 
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5. Assume that the computational graph is too large for your GPU. 
1. Can you partition it over more than one GPU? 


2. What are the advantages and disadvantages over training on a smaller minibatch? 


Discussions 104, 


5.4 Numerical Stability and Initialization 
es SS 


Thus far, every model that we have implemented required that we initialize its parameters 
according to some pre-specified distribution. Until now, we took the initialization scheme 
for granted, glossing over the details of how these choices are made. You might have even 
gotten the impression that these choices are not especially important. On the contrary, the 
choice of initialization scheme plays a significant role in neural network learning, and it 
can be crucial for maintaining numerical stability. Moreover, these choices can be tied up 
in interesting ways with the choice of the nonlinear activation function. Which function 
we choose and how we initialize parameters can determine how quickly our optimization 
algorithm converges. Poor choices here can cause us to encounter exploding or vanishing 
gradients while training. In this section, we delve into these topics in greater detail and 
discuss some useful heuristics that you will find useful throughout your career in deep 
learning. 


%matplotlib inline 
import torch 
from d21 import torch as d21 


5.4.1 Vanishing and Exploding Gradients 


Consider a deep network with L layers, input x and output o. With each layer / defined by 
a transformation f; parametrized by weights W“), whose hidden layer output is h (let 
h® = x), our network can be expressed as: 


h” = fi(h Y) and thus o = fr o-++ o fi(x). (5.4.1) 


If all the hidden layer output and the input are vectors, we can write the gradient of o with 
respect to any set of parameters W) as follows: 
Ayo = ôpa-dh” --- A mh dywh® . 
Te SS FT (5.4.2) 
MiL) mire pee 


In other words, this gradient is the product of L — / matrices M‘) .-- M+") and the 
gradient vector ví”. Thus we are susceptible to the same problems of numerical underflow 
that often crop up when multiplying together too many probabilities. When dealing with 
probabilities, a common trick is to switch into log-space, i.e., shifting pressure from the 
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mantissa to the exponent of the numerical representation. Unfortunately, our problem above 
is more serious: initially the matrices M“) may have a wide variety of eigenvalues. They 
might be small or large, and their product might be very large or very small. 


The risks posed by unstable gradients go beyond numerical representation. Gradients of 
unpredictable magnitude also threaten the stability of our optimization algorithms. We may 
be facing parameter updates that are either (i) excessively large, destroying our model (the 
exploding gradient problem); or (ii) excessively small (the vanishing gradient problem), 
rendering learning impossible as parameters hardly move on each update. 


Vanishing Gradients 


One frequent culprit causing the vanishing gradient problem is the choice of the activation 
function o that is appended following each layer’s linear operations. Historically, the sig- 
moid function 1/(1+exp(—x)) (introduced in Section 5.1) was popular because it resembles 
a thresholding function. Since early artificial neural networks were inspired by biological 
neural networks, the idea of neurons that fire either fully or not at all (like biological neu- 
rons) seemed appealing. Let’s take a closer look at the sigmoid to see why it can cause 
vanishing gradients. 


x = torch.arange(-8.@, 8.0, @.1, requires_grad=True) 
y = torch. sigmoid(x) 
y. backward(torch. ones_like(x)) 


d21.plot(x.detach() .numpy(), Ly.detach().numpy(), x.grad.numpy()], 
legend=L’sigmoid', 'gradient’], figsize=(4.5, 2.5)) 


i sigmoid 


0.84. 77- gradient 
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As you can see, the sigmoid’s gradient vanishes both when its inputs are large and when 
they are small. Moreover, when backpropagating through many layers, unless we are in the 
Goldilocks zone, where the inputs to many of the sigmoids are close to zero, the gradients 
of the overall product may vanish. When our network boasts many layers, unless we are 
careful, the gradient will likely be cut off at some layer. Indeed, this problem used to plague 
deep network training. Consequently, ReLUs, which are more stable (but less neurally 
plausible), have emerged as the default choice for practitioners. 
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Exploding Gradients 


The opposite problem, when gradients explode, can be similarly vexing. To illustrate this 
a bit better, we draw 100 Gaussian random matrices and multiply them with some initial 
matrix. For the scale that we picked (the choice of the variance o? = 1), the matrix product 
explodes. When this happens because of the initialization of a deep network, we have no 
chance of getting a gradient descent optimizer to converge. 


M = torch.normal(@, 1, size=(4, 4)) 
print(’a single matrix \n',M) 
for i in range(100): 

M = M @ torch.normal(@, 1, size=(4, 4)) 
print(’after multiplying 100 matrices\n’, M) 


a single matrix 

tensor ([[-0.8755, -1.2171, 1.3316, 0.1357], 
[ 0.4399, 1.4073, -1.9131, -0.4608], 
[-2.1420, 0.3643, -@.5267, 1.0277], 
[-0.1734, -@.7549, 2.3024, 1.3085]]) 

after multiplying 100 matrices 

tensor([[-2.9185e+23, 1.3915e+25, -1.1865e+25, 1.4354e+24], 
[ 4.9142e+23, -2.343@e+25, 1.9979e+25, -2.4169e+24], 
[ 2.6578e+23, -1.2672e+25, 1.0805e+25, -1.3072e+24], 
[-5.2223e+23, 2.4899e+25, -2.1231e+25, 2.5684e+24]]) 


Breaking the Symmetry 


Another problem in neural network design is the symmetry inherent in their parametriza- 
tion. Assume that we have a simple MLP with one hidden layer and two units. In this case, 
we could permute the weights W?) of the first layer and likewise permute the weights of 
the output layer to obtain the same function. There is nothing special differentiating the 
first and second hidden units. In other words, we have permutation symmetry among the 
hidden units of each layer. 


This is more than just a theoretical nuisance. Consider the aforementioned one-hidden- 
layer MLP with two hidden units. For illustration, suppose that the output layer transforms 
the two hidden units into only one output unit. Imagine what would happen if we initialized 
all the parameters of the hidden layer as W‘!) = c for some constant c. In this case, during 
forward propagation either hidden unit takes the same inputs and parameters producing the 
same activation which is fed to the output unit. During backpropagation, differentiating the 
output unit with respect to parameters W(!) gives a gradient all of whose elements take 
the same value. Thus, after gradient-based iteration (e.g., minibatch stochastic gradient de- 
scent), all the elements of W“") still take the same value. Such iterations would never break 
the symmetry on their own and we might never be able to realize the network’s expressive 
power. The hidden layer would behave as if it had only a single unit. Note that while mini- 
batch stochastic gradient descent would not break this symmetry, dropout regularization 
(to be introduced later) would! 
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5.4.2 Parameter Initialization 


One way of addressing—or at least mitigating—the issues raised above is through care- 
ful initialization. As we will see later, additional care during optimization and suitable 
regularization can further enhance stability. 


Default Initialization 


In the previous sections, e.g., in Section 3.5, we used a normal distribution to initialize the 
values of our weights. If we do not specify the initialization method, the framework will 
use a default random initialization method, which often works well in practice for moderate 
problem sizes. 


Xavier Initialization 


Let’s look at the scale distribution of an output o; for some fully connected layer without 
nonlinearities. With nin inputs x; and their associated weights w;; for this layer, an output 
is given by 

Nin 

oi = by WijXj. (5.4.3) 

j=1 
The weights w;; are all drawn independently from the same distribution. Furthermore, let’s 
assume that this distribution has zero mean and variance a. Note that this does not mean 
that the distribution has to be Gaussian, just that the mean and variance need to exist. For 
now, let’s assume that the inputs to the layer x ; also have zero mean and variance y? and that 
they are independent of w;; and independent of each other. In this case, we can compute 
the mean of oj: 


ae (5.4.4) 


and the variance: 


= De 2: 
Sg E[w;;x;] —0 
A (5.4.5) 
=) Elw JEDI 
j=l 
= Noy 


One way to keep the variance fixed is to set nino? = 1. Now consider backpropagation. 
There we face a similar problem, albeit with gradients being propagated from the layers 
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closer to the output. Using the same reasoning as for forward propagation, we see that the 
gradients’ variance can blow up unless nour? = 1, where nout is the number of outputs 
of this layer. This leaves us in a dilemma: we cannot possibly satisfy both conditions 
simultaneously. Instead, we simply try to satisfy: 


1 2 
= (Nin + nout)? = 1 or equivalently o = ./—— . (5.4.6) 
2 Nin + Nout 


This is the reasoning underlying the now-standard and practically beneficial Xavier initial- 
ization, named after the first author of its creators (Glorot and Bengio, 2010). Typically, the 
Xavier initialization samples weights from a Gaussian distribution with zero mean and vari- 


ance o? = —*—. We can also adapt this to choose the variance when sampling weights 
NintMNout 


from a uniform distribution. Note that the uniform distribution U(—a, a) has variance ©. 


à a: koe 2 See s 
Plugging 4% into our condition on o~ prompts us to initialize according to 


6 6 
iaae; (5.4.7) 
Nin + Nout Nin + Nout 


Though the assumption for nonexistence of nonlinearities in the above mathematical rea- 
soning can be easily violated in neural networks, the Xavier initialization method turns out 
to work well in practice. 


Beyond 


The reasoning above barely scratches the surface of modern approaches to parameter ini- 
tialization. A deep learning framework often implements over a dozen different heuristics. 
Moreover, parameter initialization continues to be a hot area of fundamental research in 
deep learning. Among these are heuristics specialized for tied (shared) parameters, super- 
resolution, sequence models, and other situations. For instance, Xiao et al. (2018) demon- 
strated the possibility of training 10,000-layer neural networks without architectural tricks 
by using a carefully-designed initialization method. 


If the topic interests you we suggest a deep dive into this module’s offerings, reading the 
papers that proposed and analyzed each heuristic, and then exploring the latest publications 
on the topic. Perhaps you will stumble across or even invent a clever idea and contribute 
an implementation to deep learning frameworks. 


5.4.3 Summary 


Vanishing and exploding gradients are common issues in deep networks. Great care in 
parameter initialization is required to ensure that gradients and parameters remain well 
controlled. Initialization heuristics are needed to ensure that the initial gradients are neither 
too large nor too small. Random initialization is key to ensuring that symmetry is broken 
before optimization. Xavier initialization suggests that, for each layer, variance of any 
output is not affected by the number of inputs, and variance of any gradient is not affected by 
the number of outputs. ReLU activation functions mitigate the vanishing gradient problem. 
This can accelerate convergence. 


1 
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5.4.4 Exercises 


1. Can you design other cases where a neural network might exhibit symmetry that needs 
breaking, besides the permutation symmetry in an MLP’s layers? 


2. Can we initialize all weight parameters in linear regression or in softmax regression to 
the same value? 


3. Look up analytic bounds on the eigenvalues of the product of two matrices. What does 
this tell you about ensuring that gradients are well conditioned? 


4. If we know that some terms diverge, can we fix this after the fact? Look at the paper on 
layerwise adaptive rate scaling for inspiration (You et al., 2017). 


Discussions! 


5.5 Generalization in Deep Learning 
ee eee ee 


In Chapter 3 and Chapter 4, we tackled regression and classification problems by fitting 
linear models to training data. In both cases, we provided practical algorithms for finding 
the parameters that maximized the likelihood of the observed training labels. And then, 
towards the end of each chapter, we recalled that fitting the training data was only an in- 
termediate goal. Our real quest all along was to discover general patterns on the basis 
of which we can make accurate predictions even on new examples drawn from the same 
underlying population. Machine learning researchers are consumers of optimization algo- 
rithms. Sometimes, we must even develop new optimization algorithms. But at the end 
of the day, optimization is merely a means to an end. At its core, machine learning is a 
statistical discipline and we wish to optimize training loss only insofar as some statistical 
principle (known or unknown) leads the resulting models to generalize beyond the training 
set. 


On the bright side, it turns out that deep neural networks trained by stochastic gradient de- 
scent generalize remarkably well across myriad prediction problems, spanning computer 
vision; natural language processing; time series data; recommender systems; electronic 
health records; protein folding; value function approximation in video games and board 
games; and numerous other domains. On the downside, if you were looking for a straight- 
forward account of either the optimization story (why we can fit them to training data) or 
the generalization story (why the resulting models generalize to unseen examples), then you 
might want to pour yourself a drink. While our procedures for optimizing linear models 
and the statistical properties of the solutions are both described well by a comprehensive 
body of theory, our understanding of deep learning still resembles the wild west on both 
fronts. 


Both the theory and practice of deep learning are rapidly evolving, with theorists adopting 
new strategies to explain what’s going on, even as practitioners continue to innovate at 
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a blistering pace, building arsenals of heuristics for training deep networks and a body of 
intuitions and folk knowledge that provide guidance for deciding which techniques to apply 
in which situations. 


The summary of the present moment is that the theory of deep learning has produced 
promising lines of attack and scattered fascinating results, but still appears far from a com- 
prehensive account of both (i) why we are able to optimize neural networks and (ii) how 
models learned by gradient descent manage to generalize so well, even on high-dimensional 
tasks. However, in practice, (i) is seldom a problem (we can always find parameters that will 
fit all of our training data) and thus understanding generalization is far the bigger problem. 
On the other hand, even absent the comfort of a coherent scientific theory, practitioners 
have developed a large collection of techniques that may help you to produce models that 
generalize well in practice. While no pithy summary can possibly do justice to the vast 
topic of generalization in deep learning, and while the overall state of research is far from 
resolved, we hope, in this section, to present a broad overview of the state of research and 
practice. 


5.5.1 Revisiting Overfitting and Regularization 


According to the “no free lunch” theorem of Wolpert and Macready (1995), any learn- 
ing algorithm generalizes better on data with certain distributions, and worse with other 
distributions. Thus, given a finite training set, a model relies on certain assumptions: to 
achieve human-level performance it may be useful to identify inductive biases that reflect 
how humans think about the world. Such inductive biases show preferences for solutions 
with certain properties. For example, a deep MLP has an inductive bias towards building 
up a complicated function by the composition of simpler functions. 


With machine learning models encoding inductive biases, our approach to training them 
typically consists of two phases: (i) fit the training data; and (ii) estimate the generalization 
error (the true error on the underlying population) by evaluating the model on holdout data. 
The difference between our fit on the training data and our fit on the test data is called the 
generalization gap and when this is large, we say that our models overfit to the training data. 
In extreme cases of overfitting, we might exactly fit the training data, even when the test 
error remains significant. And in the classical view, the interpretation is that our models are 
too complex, requiring that we either shrink the number of features, the number of nonzero 
parameters learned, or the size of the parameters as quantified. Recall the plot of model 
complexity compared with loss (Fig. 3.6.1) from Section 3.6. 


However deep learning complicates this picture in counterintuitive ways. First, for classifi- 
cation problems, our models are typically expressive enough to perfectly fit every training 
example, even in datasets consisting of millions (Zhang ef al., 2021). In the classical pic- 
ture, we might think that this setting lies on the far right extreme of the model complexity 
axis, and that any improvements in generalization error must come by way of regulariza- 
tion, either by reducing the complexity of the model class, or by applying a penalty, severely 
constraining the set of values that our parameters might take. But that is where things start 
to get weird. 
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Strangely, for many deep learning tasks (e.g., image recognition and text classification) we 
are typically choosing among model architectures, all of which can achieve arbitrarily low 
training loss (and zero training error). Because all models under consideration achieve 
zero training error, the only avenue for further gains is to reduce overfitting. Even stranger, 
it is often the case that despite fitting the training data perfectly, we can actually reduce 
the generalization error further by making the model even more expressive, e.g., adding 
layers, nodes, or training for a larger number of epochs. Stranger yet, the pattern relating 
the generalization gap to the complexity of the model (as captured, for example, in the depth 
or width of the networks) can be non-monotonic, with greater complexity hurting at first 
but subsequently helping in a so-called “double-descent” pattern (Nakkiran et al., 2021). 
Thus the deep learning practitioner possesses a bag of tricks, some of which seemingly 
restrict the model in some fashion and others that seemingly make it even more expressive, 
and all of which, in some sense, are applied to mitigate overfitting. 


Complicating things even further, while the guarantees provided by classical learning the- 
ory can be conservative even for classical models, they appear powerless to explain why 
it is that deep neural networks generalize in the first place. Because deep neural networks 
are capable of fitting arbitrary labels even for large datasets, and despite the use of famil- 
iar methods such as 2 regularization, traditional complexity-based generalization bounds, 
e.g., those based on the VC dimension or Rademacher complexity of a hypothesis class 
cannot explain why neural networks generalize. 


5.5.2 Inspiration from Nonparametrics 


Approaching deep learning for the first time, it is tempting to think of them as parametric 
models. After all, the models do have millions of parameters. When we update the models, 
we update their parameters. When we save the models, we write their parameters to disk. 
However, mathematics and computer science are riddled with counterintuitive changes of 
perspective, and surprising isomorphisms between seemingly different problems. While 
neural networks clearly have parameters, in some ways it can be more fruitful to think of 
them as behaving like nonparametric models. So what precisely makes a model nonpara- 
metric? While the name covers a diverse set of approaches, one common theme is that 
nonparametric methods tend to have a level of complexity that grows as the amount of 
available data grows. 


Perhaps the simplest example of a nonparametric model is the k-nearest neighbor algorithm 
(we will cover more nonparametric models later, for example in Section 11.2). Here, at 
training time, the learner simply memorizes the dataset. Then, at prediction time, when 
confronted with a new point x, the learner looks up the k nearest neighbors (the k points 
x, that minimize some distance d(x, x;)). When k = 1, this algorithm is called 1-nearest 
neighbors, and the algorithm will always achieve a training error of zero. That however, 
does not mean that the algorithm will not generalize. In fact, it turns out that under some 
mild conditions, the 1-nearest neighbor algorithm is consistent (eventually converging to 
the optimal predictor). 


Note that 1-nearest neighbor requires that we specify some distance function d, or equiva- 
lently, that we specify some vector-valued basis function ¢(x) for featurizing our data. For 
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any choice of the distance metric, we will achieve zero training error and eventually reach 
an optimal predictor, but different distance metrics d encode different inductive biases and 
with a finite amount of available data will yield different predictors. Different choices of 
the distance metric d represent different assumptions about the underlying patterns and the 
performance of the different predictors will depend on how compatible the assumptions are 
with the observed data. 


In a sense, because neural networks are over-parametrized, possessing many more parame- 
ters than are needed to fit the training data, they tend to interpolate the training data (fitting 
it perfectly) and thus behave, in some ways, more like nonparametric models. More re- 
cent theoretical research has established deep connection between large neural networks 
and nonparametric methods, notably kernel methods. In particular, Jacot et al. (2018) 
demonstrated that in the limit, as multilayer perceptrons with randomly initialized weights 
grow infinitely wide, they become equivalent to (nonparametric) kernel methods for a spe- 
cific choice of the kernel function (essentially, a distance function), which they call the 
neural tangent kernel. While current neural tangent kernel models may not fully explain 
the behavior of modern deep networks, their success as an analytical tool underscores the 
usefulness of nonparametric modeling for understanding the behavior of over-parametrized 
deep networks. 


5.5.3 Early Stopping 


While deep neural networks are capable of fitting arbitrary labels, even when labels are 
assigned incorrectly or randomly (Zhang et al., 2021), this capability only emerges over 
many iterations of training. A new line of work (Rolnick et al., 2017) has revealed that 
in the setting of label noise, neural networks tend to fit cleanly labeled data first and only 
subsequently to interpolate the mislabeled data. Moreover, it has been established that this 
phenomenon translates directly into a guarantee on generalization: whenever a model has 
fitted the cleanly labeled data but not randomly labeled examples included in the training 
set, it has in fact generalized (Garg et al., 2021). 


Together these findings help to motivate early stopping, a classic technique for regularizing 
deep neural networks. Here, rather than directly constraining the values of the weights, one 
constrains the number of epochs of training. The most common way to determine the 
stopping criterion is to monitor validation error throughout training (typically by checking 
once after each epoch) and to cut off training when the validation error has not decreased 
by more than some small amount e for some number of epochs. This is sometimes called a 
patience criterion. As well as the potential to lead to better generalization in the setting of 
noisy labels, another benefit of early stopping is the time saved. Once the patience criterion 
is met, one can terminate training. For large models that might require days of training 
simultaneously across eight or more GPUs, well-tuned early stopping can save researchers 
days of time and can save their employers many thousands of dollars. 


Notably, when there is no label noise and datasets are realizable (the classes are truly sep- 
arable, e.g., distinguishing cats from dogs), early stopping tends not to lead to significant 
improvements in generalization. On the other hand, when there is label noise, or intrinsic 
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variability in the label (e.g., predicting mortality among patients), early stopping is crucial. 
Training models until they interpolate noisy data is typically a bad idea. 


5.5.4 Classical Regularization Methods for Deep Networks 


In Chapter 3, we described several classical regularization techniques for constraining the 
complexity of our models. In particular, Section 3.7 introduced a method called weight 
decay, which consists of adding a regularization term to the loss function in order to penalize 
large values of the weights. Depending on which weight norm is penalized this technique 
is known either as ridge regularization (for & penalty) or lasso regularization (for an ¢ 
penalty). In the classical analysis of these regularizers, they are considered as sufficiently 
restrictive on the values that the weights can take to prevent the model from fitting arbitrary 
labels. 


In deep learning implementations, weight decay remains a popular tool. However, re- 
searchers have noted that typical strengths of f2 regularization are insufficient to prevent the 
networks from interpolating the data (Zhang et al., 2021) and thus the benefits if interpreted 
as regularization might only make sense in combination with the early stopping criterion. 
Absent early stopping, it is possible that just like the number of layers or number of nodes 
(in deep learning) or the distance metric (in 1-nearest neighbor), these methods may lead to 
better generalization not because they meaningfully constrain the power of the neural net- 
work but rather because they somehow encode inductive biases that are better compatible 
with the patterns found in datasets of interests. Thus, classical regularizers remain popular 
in deep learning implementations, even if the theoretical rationale for their efficacy may be 
radically different. 


Notably, deep learning researchers have also built on techniques first popularized in classi- 
cal regularization contexts, such as adding noise to model inputs. In the next section we will 
introduce the famous dropout technique (invented by Srivastava et al. (2014)), which has 
become a mainstay of deep learning, even as the theoretical basis for its efficacy remains 
similarly mysterious. 


5.5.5 Summary 


Unlike classical linear models, which tend to have fewer parameters than examples, deep 
networks tend to be over-parametrized, and for most tasks are capable of perfectly fitting 
the training set. This interpolation regime challenges many hard fast-held intuitions. Func- 
tionally, neural networks look like parametric models. But thinking of them as nonpara- 
metric models can sometimes be a more reliable source of intuition. Because it is often 
the case that all deep networks under consideration are capable of fitting all of the train- 
ing labels, nearly all gains must come by mitigating overfitting (closing the generalization 
gap). Paradoxically, the interventions that reduce the generalization gap sometimes appear 
to increase model complexity and at other times appear to decrease complexity. However, 
these methods seldom decrease complexity sufficiently for classical theory to explain the 
generalization of deep networks, and why certain choices lead to improved generalization 
remains for the most part a massive open question despite the concerted efforts of many 
brilliant researchers. 
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5.5.6 Exercises 


1. In what sense do traditional complexity-based measures fail to account for generalization 
of deep neural networks? 


2. Why might early stopping be considered a regularization technique? 
3. How do researchers typically determine the stopping criterion? 


4. What important factor seems to differentiate cases when early stopping leads to big 
improvements in generalization? 


5. Beyond generalization, describe another benefit of early stopping. 


Discussions !°°, 


5.6 Dropout 


Let’s think briefly about what we expect from a good predictive model. We want it to pe- 
form well on unseen data. Classical generalization theory suggests that to close the gap 
between train and test performance, we should aim for a simple model. Simplicity can 
come in the form of a small number of dimensions. We explored this when discussing the 
monomial basis functions of linear models in Section 3.6. Additionally, as we saw when 
discussing weight decay (2 regularization) in Section 3.7, the (inverse) norm of the param- 
eters also represents a useful measure of simplicity. Another useful notion of simplicity 
is smoothness, i.e., that the function should not be sensitive to small changes to its inputs. 
For instance, when we classify images, we would expect that adding some random noise to 
the pixels should be mostly harmless. 


Bishop (1995) formalized this idea when he proved that training with input noise is equiva- 
lent to Tikhonov regularization. This work drew a clear mathematical connection between 
the requirement that a function be smooth (and thus simple), and the requirement that it be 
resilient to perturbations in the input. 


Then, Srivastava et al. (2014) developed a clever idea for how to apply Bishop’s idea to the 
internal layers of a network, too. Their idea, called dropout, involves injecting noise while 
computing each internal layer during forward propagation, and it has become a standard 
technique for training neural networks. The method is called dropout because we literally 
drop out some neurons during training. Throughout training, on each iteration, standard 
dropout consists of zeroing out some fraction of the nodes in each layer before calculating 
the subsequent layer. 


To be clear, we are imposing our own narrative with the link to Bishop. The original pa- 
per on dropout offers intuition through a surprising analogy to sexual reproduction. The 
authors argue that neural network overfitting is characterized by a state in which each layer 
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relies on a specific pattern of activations in the previous layer, calling this condition co- 
adaptation. Dropout, they claim, breaks up co-adaptation just as sexual reproduction is 
argued to break up co-adapted genes. While such an justification of this theory is cer- 
tainly up for debate, the dropout technique itself has proved enduring, and various forms of 
dropout are implemented in most deep learning libraries. 


The key challenge is how to inject this noise. One idea is to inject it in an unbiased manner 
so that the expected value of each layer—while fixing the others—equals the value it would 
have taken absent noise. In Bishop’s work, he added Gaussian noise to the inputs to a linear 
model. At each training iteration, he added noise sampled from a distribution with mean 
zero € ~ N(0,c7) to the input x, yielding a perturbed point x’ = x + e. In expectation, 
E[x’] =x. 


In standard dropout regularization, one zeros out some fraction of the nodes in each layer 
and then debiases each layer by normalizing by the fraction of nodes that were retained (not 
dropped out). In other words, with dropout probability p, each intermediate activation h is 
replaced by a random variable h’ as follows: 


0 with probabilit 
W= | J di (5.6.1) 


5 otherwise 
-=p 


By design, the expectation remains unchanged, i.e., E[h’] = h. 


import torch 
from torch import nn 
from d21 import torch as d21 


5.6.1 Dropout in Practice 


Recall the MLP with a hidden layer and five hidden units from Fig. 5.1.1. When we apply 
dropout to a hidden layer, zeroing out each hidden unit with probability p, the result can 
be viewed as a network containing only a subset of the original neurons. In Fig. 5.6.1, h2 
and hs are removed. Consequently, the calculation of the outputs no longer depends on hz 
or h5 and their respective gradient also vanishes when performing backpropagation. In this 
way, the calculation of the output layer cannot be overly dependent on any one element of 
hy,..., hs. 


Before dropout After dropout 


MLP before and after dropout. 
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Typically, we disable dropout at test time. Given a trained model and a new example, 
we do not drop out any nodes and thus do not need to normalize. However, there are 
some exceptions: some researchers use dropout at test time as a heuristic for estimating the 
uncertainty of neural network predictions: if the predictions agree across many different 
dropout outputs, then we might say that the network is more confident. 


5.6.2 Implementation from Scratch 


To implement the dropout function for a single layer, we must draw as many samples from a 
Bernoulli (binary) random variable as our layer has dimensions, where the random variable 
takes value 1 (keep) with probability 1 — p and O (drop) with probability p. One easy way 
to implement this is to first draw samples from the uniform distribution U[0, 1]. Then we 
can keep those nodes for which the corresponding sample is greater than p, dropping the 
rest. 


In the following code, we implement a dropout_layer function that drops out the elements 
in the tensor input X with probability dropout, rescaling the remainder as described above: 
dividing the survivors by 1.0-dropout. 


def dropout_layer(X, dropout): 
assert ð <= dropout <= 1 
if dropout == 1: return torch.zeros_like(X) 
mask = (torch.rand(X.shape) > dropout) .float() 
return mask x X / (1.0 - dropout) 


We can test out the dropout_layer function on a few examples. In the following lines of 
code, we pass our input X through the dropout operation, with probabilities 0, 0.5, and 1, 
respectively. 


X = torch.arange(16, dtype = torch.float32).reshape((2, 8)) 
print(’dropout_p = 0:', dropout_layer(X, @)) 
print(’dropout_p = @.5:’, dropout_layer(X, @.5)) 
print(’dropout_p = 1:’, dropout_layer(X, 1)) 


dropout_p = @: tensor([[ ð., 1., 2., 3., 4., 5., 6., 7.], 
[8., 9., 10., 11., 12., 13., 14., 15.1]) 
dropout_p = @.5: tensor([[ 0., 2., @., 6 8 ð., @., 0.7, 


[16., 18., 20., 22., 24., 26., 28., 30.]]) 
dropout_p = 1: tensor([L0., @., ©., ©., ©., ©., ©., 0.], 
o., @., @., ©., ©., ©., @, 0.11) 


Defining the Model 


The model below applies dropout to the output of each hidden layer (following the activation 
function). We can set dropout probabilities for each layer separately. A common choice is 
to set a lower dropout probability closer to the input layer. We ensure that dropout is only 
active during training. 
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class DropoutMLPScratch(d21.Classifier): 
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2, 
dropout_1, dropout_2, 1r): 
super().__init__Q 
self.save_hyperparameters() 
self.linl = nn.LazyLinear(num_hiddens_1) 
self.lin2 = nn.LazyLinear(num_hiddens_2) 
self.lin3 = nn.LazyLinear(num_outputs) 
self.relu = nn.ReLU() 


def forward(self, X): 
H1 = self.relu(self.lin1(X.reshape((X.shapelQ], -1)))) 
if self.training: 
H1 = dropout_layer(H1, self.dropout_1) 
H2 = self.relu(self.1in2(H1)) 
if self.training: 
H2 = dropout_layer(H2, self.dropout_2) 
return self.1in3(H2) 


Training 


The following is similar to the training of MLPs described previously. 


hparams = {’num_outputs’:10, 'num_hiddens_1':256, '‘num_hiddens_2':256, 
“Gleevec SG, OTTONE 2 Sal's, “ile eal} 

model = DropoutMLPScratch(**hparams) 

data = d21.FashionMNIST (batch_size=256) 

trainer = d21.Trainer(max_epochs=10) 

trainer.fit(model, data) 


1.47 — train_loss 
==- val_loss 
1.25 == val_acc 


5.6.3 Concise Implementation 


With high-level APIs, all we need to do is add a Dropout layer after each fully connected 
layer, passing in the dropout probability as the only argument to its constructor. During 
training, the Dropout layer will randomly drop out outputs of the previous layer (or equiv- 
alently, the inputs to the subsequent layer) according to the specified dropout probability. 
When not in training mode, the Dropout layer simply passes the data through during test- 
ing. 
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class DropoutMLP(d21.Classifier): 
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2, 
dropout_1, dropout_2, 1r): 
super().__init__Q 
self.save_hyperparameters() 
self.net = nn.Sequential ( 
nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(), 
nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(), 
nn.Dropout (dropout_2), nn.LazyLinear(num_outputs) ) 


Next, we train the model. 


model = DropoutMLP(«*xhparams) 
trainer.fit(model, data) 


1.44 —— train_loss 
=-=- val_loss 
—-- val_acc 


0 2 4 6 8 10 


5.6.4 Summary 


Beyond controlling the number of dimensions and the size of the weight vector, dropout is 
yet another tool for avoiding overfitting. Often tools are used jointly. Note that dropout is 
used only during training: it replaces an activation h with a random variable with expected 
value A. 


5.6.5 Exercises 


1. What happens if you change the dropout probabilities for the first and second layers? 
In particular, what happens if you switch the ones for both layers? Design an experi- 
ment to answer these questions, describe your results quantitatively, and summarize the 
qualitative takeaways. 


2. Increase the number of epochs and compare the results obtained when using dropout 
with those when not using it. 


3. What is the variance of the activations in each hidden layer when dropout is and is not 
applied? Draw a plot to show how this quantity evolves over time for both models. 


4. Why is dropout not typically used at test time? 


5. Using the model in this section as an example, compare the effects of using dropout and 
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weight decay. What happens when dropout and weight decay are used at the same time? 
Are the results additive? Are there diminished returns (or worse)? Do they cancel each 
other out? 


6. What happens if we apply dropout to the individual weights of the weight matrix rather 
than the activations? 


7. Invent another technique for injecting random noise at each layer that is different from 
the standard dropout technique. Can you develop a method that outperforms dropout on 
the Fashion-MNIST dataset (for a fixed architecture)? 


Discussions 07 . 


5.7 Predicting House Prices on Kaggle 
R) 


Now that we have introduced some basic tools for building and training deep networks and 
regularizing them with techniques including weight decay and dropout, we are ready to 
put all this knowledge into practice by participating in a Kaggle competition. The house 
price prediction competition is a great place to start. The data is fairly generic and do not 
exhibit exotic structure that might require specialized models (as audio or video might). 
This dataset, collected by De Cock (2011), covers house prices in Ames, Iowa from the 
period 2006-2010. It is considerably larger than the famous Boston housing dataset 108 of 
Harrison and Rubinfeld (1978), boasting both more examples and more features. 


In this section, we will walk you through details of data preprocessing, model design, and 
hyperparameter selection. We hope that through a hands-on approach, you will gain some 
intuitions that will guide you in your career as a data scientist. 


zmatplotlib inline 

import pandas as pd 

import torch 

from torch import nn 

from d21 import torch as d21 


5.7.1 Downloading Data 


Throughout the book, we will train and test models on various downloaded datasets. Here, 
we implement two utility functions for downloading and extracting zip or tar files. Again, 
we skip implementation details of such utility functions. 


def download(url, folder, shal_hash=None) : 
"""Download a file to folder and return the local filepath.””” 


def extract(filename, folder): 
"Extract a zip/tar file into folder. 


non 
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5.7.2 Kaggle 


Kaggle !°° is a popular platform that hosts machine learning competitions. Each com- 
petition centers on a dataset and many are sponsored by stakeholders who offer prizes to 
the winning solutions. The platform helps users to interact via forums and shared code, 
fostering both collaboration and competition. While leaderboard chasing often spirals out 
of control, with researchers focusing myopically on preprocessing steps rather than asking 
fundamental questions, there is also tremendous value in the objectivity of a platform that 
facilitates direct quantitative comparisons among competing approaches as well as code 
sharing so that everyone can learn what did and did not work. If you want to participate in 
a Kaggle competition, you will first need to register for an account (see Fig. 5.7.1). 


Search kaggle Q Competitions Datasets Kernels Discussion Learn +++ BEEE 


Kaggle is the place to do data 
science projects Sign up with just one click: 


We won't share anything without your permission 
See how it works © 


| Google | Facebook || Yahoo | 


Manually create an account: 


Email 


Password 


The Kaggle website. 


On the house price prediction competition page, as illustrated in Fig. 5.7.2, you can find 
the dataset (under the “Data” tab), submit predictions, and see your ranking, The URL is 
right here: 


https://www.kaggle.com/c/house-prices-advanced-regression-techniques 


House Prices: Advanced Regression Techniques 


Predict sales prices and practice feature engineering, RFs, and gradient boosting 


5,012 teams - Ongoing 


Overview Data Kernels Discussion Leaderboard Rules Team My Submissions Submit Predictions 


Overview 


Description Start here if... 

Evaluation You have some experience with R or Python and machine learning basics. This is a perfect competition 
Frequently Asked for data science students who have completed an online course in machine learning and are looking to 
Questions expand their skill set before trying a featured competition. 

Tutorials Competition Description 


The house price prediction competition page. 


5.7.3 Accessing and Reading the Dataset 
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Note that the competition data is separated into training and test sets. Each record includes 
the property value of the house and attributes such as street type, year of construction, roof 
type, basement condition, etc. The features consist of various data types. For example, 
the year of construction is represented by an integer, the roof type by discrete categorical 
assignments, and other features by floating point numbers. And here is where reality com- 
plicates things: for some examples, some data is altogether missing with the missing value 
marked simply as “na”. The price of each house is included for the training set only (it is 
a competition after all). We will want to partition the training set to create a validation set, 
but we only get to evaluate our models on the official test set after uploading predictions to 
Kaggle. The “Data” tab on the competition tab in Fig. 5.7.2 has links for downloading the 
data. 


To get started, we will read in and process the data using pandas, which we introduced 
in Section 2.2. For convenience, we can download and cache the Kaggle housing dataset. 
If a file corresponding to this dataset already exists in the cache directory and its SHA-1 
matches shal_hash, our code will use the cached file to avoid clogging up your Internet 
with redundant downloads. 


class KaggleHouse(d21.DataModule) : 
def __init__(self, batch_size, train=None, val=None): 
spero Simitin) 
self .save_hyperparameters() 
if self.train is None: 
self.raw_train = pd.read_csv(d21.download( 
d21.DATA_URL + 'kaggle_house_pred_train.csv', self.root, 
shal_hash='585e9cc93e70b39160e7921475f9bcd7d31219ce')) 
self.raw_val = pd.read_csv(d21.download( 
d21.DATA_URL + 'kaggle_house_pred_test.csv’, self.root, 
shal_hash=' fa19780a7b011d9b009e8bf f8e99922a8ee2eb90’)) 


The training dataset includes 1460 examples, 80 features, and one label, while the validation 
data contains 1459 examples and 80 features. 


data = KaggleHouse(batch_size=64) 
print(data.raw_train. shape) 
print (data. raw_val. shape) 


Downloading ../data/kaggle_house_pred_train.csv from http://d21-data.s3- 
accelerate. amazonaws.com/kaggle_house_pred_train.csv... 

Downloading ../data/kaggle_house_pred_test.csv from http://d21-data.s3- 
~accelerate.amazonaws.com/kaggle_house_pred_test.csv... 

(1460, 81) 

(1459, 80) 


5.7.4 Data Preprocessing 


Let’s take a look at the first four and final two features as well as the label (SalePrice) from 
the first four examples. 
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print(data.raw_train.iloc[:4, [@, 1, 2, 3, -3, -2, -1]]) 


Id MSSubClass MSZoning LotFrontage SaleType SaleCondition SalePrice 


ð 1 60 RL 65.0 WD Normal 208500 
t 2 20 RL 80.0 WD Normal 181500 
2 3 60 RL 68.0 WD Normal 223500 
3 4 70 RL 60.0 WD Abnorml 140000 


We can see that in each example, the first feature is the identifier. This helps the model 
determine each training example. While this is convenient, it does not carry any information 
for prediction purposes. Hence, we will remove it from the dataset before feeding the data 
into the model. Furthermore, given a wide variety of data types, we will need to preprocess 
the data before we can start modeling. 


Let’s start with the numerical features. First, we apply a heuristic, replacing all missing 
values by the corresponding feature’s mean. Then, to put all features on a common scale, 


we standardize the data by rescaling features to zero mean and unit variance: 
x- 
ge Ek, (5.7.1) 


g 


where u and o denote mean and standard deviation, respectively. To verify that this indeed 
transforms our feature (variable) such that it has zero mean and unit variance, note that 
E[=*] = Æ = 0 and that E[(x - )”] = (0? +p’) -— 2p? + p? = o°. Intuitively, we 
standardize the data for two reasons. First, it proves convenient for optimization. Second, 
because we do not know a priori which features will be relevant, we do not want to penalize 
coefficients assigned to one feature more than any other. 


Next we deal with discrete values. These include features such as “MSZoning”. We replace 
them by a one-hot encoding in the same way that we earlier transformed multiclass labels 
into vectors (see Section 4.1.1). For instance, “MSZoning” assumes the values “RL” and 
“RM”. Dropping the “MSZoning” feature, two new indicator features “MSZoning RL” 
and “MSZoning RM” are created with values being either 0 or 1. According to one-hot 
encoding, if the original value of “MSZoning” is “RL”, then “MSZoning RL” is 1 and 
“MSZoning_ RM” is 0. The pandas package does this automatically for us. 


@d21.add_to_class(KaggleHouse) 
def preprocess(self): 
# Remove the ID and label columns 
label = 'SalePrice’ 
features = pd.concat( 
(self.raw_train.drop(columns=['Id', label]), 
self .raw_val.drop(columns=['Id']))) 
# Standardize numerical columns 
numeric_features = features.dtypes[features.dtypes!='object’].index 
features[numeric_features] = features[numeric_features].apply( 
lambda x: (x - x.mean()) / (x.stdQ))) 
# Replace NAN numerical features by @ 
features[numeric_features] = features[numeric_features].fillna(@) 


(continues on next page) 
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(continued from previous page) 


# Replace discrete features by one-hot encoding 
features = pd.get_dummies(features, dummy_na=True) 

# Save preprocessed features 

self.train = features[:self.raw_train. shapel0]].copy() 
self.trainLlabel] = self.raw_train[label] 

self.val = features[self.raw_train.shapelQ]:].copy() 


You can see that this conversion increases the number of features from 79 to 331 (excluding 
ID and label columns). 


data.preprocess() 
data. train. shape 


(1460, 331) 


5.7.5 Error Measure 


To get started we will train a linear model with squared loss. Not surprisingly, our linear 
model will not lead to a competition-winning submission but it does provide a sanity check 
to see whether there is meaningful information in the data. If we cannot do better than 
random guessing here, then there might be a good chance that we have a data processing 
bug. And if things work, the linear model will serve as a baseline giving us some intuition 
about how close the simple model gets to the best reported models, giving us a sense of 
how much gain we should expect from fancier models. 


With house prices, as with stock prices, we care about relative quantities more than ab- 
solute quantities. Thus we tend to care more about the relative error = than about the 
absolute error y — ĵ. For instance, if our prediction is off by $100,000 when estimating the 
price of a house in rural Ohio, where the value of a typical house is $125,000, then we are 
probably doing a horrible job. On the other hand, if we err by this amount in Los Altos 
Hills, California, this might represent a stunningly accurate prediction (there, the median 
house price exceeds $4 million). 


One way to address this problem is to measure the discrepancy in the logarithm of the price 
estimates. In fact, this is also the official error measure used by the competition to evaluate 
the quality of submissions. After all, a small value 6 for | log y — log $| < 6 translates into 
e7’ < = < e°. This leads to the following root-mean-squared-error between the logarithm 
of the predicted price and the logarithm of the label price: 


1 n . 
= > (log y; — log $. (5.7.2) 
M 


@d21.add_to_class(KaggleHouse) 
def get_dataloader(self, train): 


(continues on next page) 
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(continued from previous page) 


label = 'SalePrice’ 
data = self.train if train else self.val 
if label not in data: return 
get_tensor = lambda x: torch.tensor(x.values.astype(float), 
dtype=torch. float32) 
# Logarithm of prices 
tensors = (get_tensor(data.drop(columns=[label])), # X 
torch. log(get_tensor(dataLllabel])).reshape((-1, 1))) # Y 
return self.get_tensorloader(tensors, train) 


5.7.6 K-Fold Cross-Validation 


You might recall that we introduced cross-validation in Section 3.6.3, where we discussed 
how to deal with model selection. We will put this to good use to select the model design 
and to adjust the hyperparameters. We first need a function that returns the i" fold of the 
data in a K-fold cross-validation procedure. It proceeds by slicing out the i™® segment as 
validation data and returning the rest as training data. Note that this is not the most efficient 
way of handling data and we would definitely do something much smarter if our dataset was 
considerably larger. But this added complexity might obfuscate our code unnecessarily so 
we can safely omit it here owing to the simplicity of our problem. 


def k_fold_data(data, k): 

rets = [] 

fold_size = data.train.shape[Q] // k 

for j in range(k): 
idx = range(j * fold_size, (j+1) * fold_size) 
rets.append(KaggleHouse(data.batch_size, data.train.drop(index=idx) , 

data. train. loc[idx])) 
return rets 


The average validation error is returned when we train K times in the K-fold cross-validation. 


def k_fold(trainer, data, k, 1r): 

val_loss, models = [], [] 

for i, data_fold in enumerate(k_fold_data(data, k)): 
model = d21.LinearRegression(l1r) 
model. board. yscale=’ log’ 
if i != 0: model.board.display = False 
trainer.fit(model, data_fold) 
val_loss.append(float(model.board.dataL’val_loss'][-1].y)) 
models. append(mode1) 

print(f'average validation log mse = {sum(val_loss)/len(val_loss) }’) 

return models 


5.7.7 Model Selection 


In this example, we pick an untuned set of hyperparameters and leave it up to the reader to 
improve the model. Finding a good choice can take time, depending on how many variables 
one optimizes over. With a large enough dataset, and the normal sorts of hyperparameters, 
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K-fold cross-validation tends to be reasonably resilient against multiple testing. However, 
if we try an unreasonably large number of options we might find that our validation perfor- 
mance is no longer representative of the true error. 


trainer = d21.Trainer(max_epochs=10) 
models = k_fold(trainer, data, k=5, 1r=0.01) 


average validation log mse = Q.17325432986021042 


— train_loss 
14 
10 ==- val_loss 


Notice that sometimes the number of training errors for a set of hyperparameters can be very 
low, even as the number of errors on K-fold cross-validation grows considerably higher. 
This indicates that we are overfitting. Throughout training you will want to monitor both 
numbers. Less overfitting might indicate that our data can support a more powerful model. 
Massive overfitting might suggest that we can gain by incorporating regularization tech- 
niques. 


5.7.8 Submitting Predictions on Kaggle 


Now that we know what a good choice of hyperparameters should be, we might calculate 
the average predictions on the test set by all the K models. Saving the predictions in a csv 
file will simplify uploading the results to Kaggle. The following code will generate a file 
called submission.csv. 


preds = [model(torch.tensor(data.val.values.astype(float), dtype=torch. 
—float32)) 

for model in models] 
# Taking exponentiation of predictions in the logarithm scale 
ensemble_preds = torch.exp(torch.cat(preds, 1)).mean(1) 
submission = pd.DataFrame({'Id':data.raw_val.Id, 

"SalePrice':ensemble_preds.detach() .numpy() }) 

submission. to_csv('’submission.csv’, index=False) 


Next, as demonstrated in Fig. 5.7.3, we can submit our predictions on Kaggle and see how 
they compare with the actual house prices (labels) on the test set. The steps are quite 
simple: 


e Log in to the Kaggle website and visit the house price prediction competition page. 
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e Click the “Submit Predictions” or “Late Submission” button. 


e Click the “Upload Submission File” button in the dashed box at the bottom of the page 
and select the prediction file you wish to upload. 


e Click the “Make Submission” button at the bottom of the page to view your results. 


Step1 


Jpload submission file 


t 


Upload Submission File 


Your submission should be in CSV format. We expect the solution file to have 1459 prediction rows. This file 
You can upload this in a zip/gz/rar/7z should have a header row. Please see sample submission file on 
archive, if you prefer. the data page. 


Make Submission 


| Submitting data to Kaggle 


5.7.9 Summary and Discussion 


Real data often contains a mix of different data types and needs to be preprocessed. Rescal- 
ing real-valued data to zero mean and unit variance is a good default. So is replacing miss- 
ing values with their mean. Furthermore, transforming categorical features into indicator 
features allows us to treat them like one-hot vectors. When we tend to care more about the 
relative error than about the absolute error, we can measure the discrepancy in the loga- 
rithm of the prediction. To select the model and adjust the hyperparameters, we can use 
K-fold cross-validation . 


5.7.10 Exercises 
1. Submit your predictions for this section to Kaggle. How good are they? 


2. Is it always a good idea to replace missing values by a mean? Hint: can you construct a 
situation where the values are not missing at random? 


3. Improve the score by tuning the hyperparameters through K-fold cross-validation. 
4. Improve the score by improving the model (e.g., layers, weight decay, and dropout). 


5. What happens if we do not standardize the continuous numerical features as we have 
done in this section? 


Discussions !!9, 
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Alongside giant datasets and powerful hardware, great software tools have played an in- 
dispensable role in the rapid progress of deep learning. Starting with the pathbreaking 
Theano library released in 2007, flexible open-source tools have enabled researchers to 
rapidly prototype models, avoiding repetitive work when recycling standard components 
while still maintaining the ability to make low-level modifications. Over time, deep learn- 
ing’s libraries have evolved to offer increasingly coarse abstractions. Just as semiconductor 
designers went from specifying transistors to logical circuits to writing code, neural net- 
works researchers have moved from thinking about the behavior of individual artificial neu- 
rons to conceiving of networks in terms of whole layers, and now often design architectures 
with far coarser blocks in mind. 


So far, we have introduced some basic machine learning concepts, ramping up to fully- 
functional deep learning models. In the last chapter, we implemented each component of 
an MLP from scratch and even showed how to leverage high-level APIs to roll out the 
same models effortlessly. To get you that far that fast, we called upon the libraries, but 
skipped over more advanced details about how they work. In this chapter, we will peel 
back the curtain, digging deeper into the key components of deep learning computation, 
namely model construction, parameter access and initialization, designing custom layers 
and blocks, reading and writing models to disk, and leveraging GPUs to achieve dramatic 
speedups. These insights will move you from end user to power user, giving you the tools 
needed to reap the benefits of a mature deep learning library while retaining the flexibility to 
implement more complex models, including those you invent yourself! While this chapter 
does not introduce any new models or datasets, the advanced modeling chapters that follow 
rely heavily on these techniques. 


6.1 Layers and Modules 
LL ——SS 


When we first introduced neural networks, we focused on linear models with a single out- 
put. Here, the entire model consists of just a single neuron. Note that a single neuron (i) 
takes some set of inputs; (ii) generates a corresponding scalar output; and (iii) has a set of 
associated parameters that can be updated to optimize some objective function of interest. 
Then, once we started thinking about networks with multiple outputs, we leveraged vec- 
torized arithmetic to characterize an entire layer of neurons. Just like individual neurons, 
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layers (i) take a set of inputs, (ii) generate corresponding outputs, and (iii) are described by 
a set of tunable parameters. When we worked through softmax regression, a single layer 
was itself the model. However, even when we subsequently introduced MLPs, we could 
still think of the model as retaining this same basic structure. 


Interestingly, for MLPs, both the entire model and its constituent layers share this structure. 
The entire model takes in raw inputs (the features), generates outputs (the predictions), and 
possesses parameters (the combined parameters from all constituent layers). Likewise, 
each individual layer ingests inputs (supplied by the previous layer) generates outputs (the 
inputs to the subsequent layer), and possesses a set of tunable parameters that are updated 
according to the signal that flows backwards from the subsequent layer. 


While you might think that neurons, layers, and models give us enough abstractions to go 
about our business, it turns out that we often find it convenient to speak about components 
that are larger than an individual layer but smaller than the entire model. For example, the 
ResNet-152 architecture, which is wildly popular in computer vision, possesses hundreds of 
layers. These layers consist of repeating patterns of groups of layers. Implementing such 
a network one layer at a time can grow tedious. This concern is not just hypothetical— 
such design patterns are common in practice. The ResNet architecture mentioned above 
won the 2015 ImageNet and COCO computer vision competitions for both recognition and 
detection (He et al., 2016) and remains a go-to architecture for many vision tasks. Similar 
architectures in which layers are arranged in various repeating patterns are now ubiquitous 
in other domains, including natural language processing and speech. 


To implement these complex networks, we introduce the concept of a neural network mod- 
ule. A module could describe a single layer, a component consisting of multiple layers, 
or the entire model itself! One benefit of working with the module abstraction is that they 
can be combined into larger artifacts, often recursively. This is illustrated in Fig. 6.1.1. 
By defining code to generate modules of arbitrary complexity on demand, we can write 
surprisingly compact code and still implement complex neural networks. 


<< 


Multiple layers are combined into modules, forming repeating patterns of larger models. 


From a programming standpoint, a module is represented by a class. Any subclass of it 
must define a forward propagation method that transforms its input into output and must 
store any necessary parameters. Note that some modules do not require any parameters at 
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all. Finally a module must possess a backpropagation method, for purposes of calculating 
gradients. Fortunately, due to some behind-the-scenes magic supplied by the auto differen- 
tiation (introduced in Section 2.5) when defining our own module, we only need to worry 
about parameters and the forward propagation method. 


import torch 
from torch import nn 
from torch.nn import functional as F 


To begin, we revisit the code that we used to implement MLPs (Section 5.1). The follow- 
ing code generates a network with one fully connected hidden layer with 256 units and 
ReLU activation, followed by a fully connected output layer with ten units (no activation 
function). 


net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(1@)) 


X = torch.rand(2, 20) 
net (X).shape 


torch.Size([2, 10]) 


In this example, we constructed our model by instantiating an nn. Sequential, with layers 
in the order that they should be executed passed as arguments. In short, nn. Sequential 
defines a special kind of Module, the class that presents a module in PyTorch. It maintains 
an ordered list of constituent Modules. Note that each of the two fully connected layers is an 
instance of the Linear class which is itself a subclass of Module. The forward propagation 
(forward) method is also remarkably simple: it chains each module in the list together, 
passing the output of each as input to the next. Note that until now, we have been invok- 
ing our models via the construction net (X) to obtain their outputs. This is actually just 
shorthand for net.__call__(X). 


6.1.1 A Custom Module 


Perhaps the easiest way to develop intuition about how a module works is to implement 
one ourselves. Before we do that, we briefly summarize the basic functionality that each 
module must provide: 


1. Ingest input data as arguments to its forward propagation method. 


2. Generate an output by having the forward propagation method return a value. Note 
that the output may have a different shape from the input. For example, the first fully 
connected layer in our model above ingests an input of arbitrary dimension but returns 
an output of dimension 256. 


3. Calculate the gradient of its output with respect to its input, which can be accessed via 
its backpropagation method. Typically this happens automatically. 
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4. Store and provide access to those parameters necessary for executing the forward prop- 
agation computation. 


5. Initialize model parameters as needed. 


In the following snippet, we code up a module from scratch corresponding to an MLP 
with one hidden layer with 256 hidden units, and a 10-dimensional output layer. Note that 
the MLP class below inherits the class that represents a module. We will heavily rely on 
the parent class’s methods, supplying only our own constructor (the __init__ method in 
Python) and the forward propagation method. 


class MLP(nn.Module) : 
def init _ (self): 


# Call the constructor of the parent class nn.Module to perform 
# the necessary initialization 

super().__init__Q 

self .hidden = nn.LazyLinear (256) 

self.out = nn.LazyLinear (10) 


# Define the forward propagation of the model, that is, how to return the 
# required model output based on the input X 
def forward(self, X): 

return self.out(F.relu(self.hidden(X))) 


Let’s first focus on the forward propagation method. Note that it takes X as input, calcu- 
lates the hidden representation with the activation function applied, and outputs its logits. 
In this MLP implementation, both layers are instance variables. To see why this is reason- 
able, imagine instantiating two MLPs, net1 and net2, and training them on different data. 
Naturally, we would expect them to represent two different learned models. 


We instantiate the MLP’s layers in the constructor and subsequently invoke these layers on 
each call to the forward propagation method. Note a few key details. First, our customized 
__init__ method invokes the parent class’s __init__ method via super().__init__Q 
sparing us the pain of restating boilerplate code applicable to most modules. We then 
instantiate our two fully connected layers, assigning them to self .hidden and self. out. 
Note that unless we implement a new layer, we need not worry about the backpropagation 
method or parameter initialization. The system will generate these methods automatically. 
Let’s try this out. 


net = MLP() 
net (X).shape 


torch.Size([2, 10]) 


A key virtue of the module abstraction is its versatility. We can subclass a module to create 
layers (such as the fully connected layer class), entire models (such as the MLP class above), 
or various components of intermediate complexity. We exploit this versatility throughout 
the coming chapters, such as when addressing convolutional neural networks. 
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6.1.2 The Sequential Module 


We can now take a closer look at how the Sequential class works. Recall that Sequen- 
tial was designed to daisy-chain other modules together. To build our own simplified 
MySequential, we just need to define two key methods: 


1. A method for appending modules one by one to a list. 


2. A forward propagation method for passing an input through the chain of modules, in the 
same order as they were appended. 


The following MySequential class delivers the same functionality of the default Sequen- 
tial class. 


class MySequential(nn.Module) : 
def __init__(self, xargs): 
super().__init__Q 
for idx, module in enumerate(args): 
self .add_module(str(idx), module) 


def forward(self, X): 
for module in self.children(): 
X = module(X) 
return X 


In the __init__ method, we add every module by calling the add_modules method. These 
modules can be accessed by the children method at a later date. In this way the system 
knows the added modules, and it will properly initialize each module’s parameters. 


When our MySequential’s forward propagation method is invoked, each added module is 
executed in the order in which they were added. We can now reimplement an MLP using 
our MySequential class. 


net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10)) 
net (X) .shape 


torch.Size([2, 10]) 


Note that this use of MySequential is identical to the code we previously wrote for the 
Sequential class (as described in Section 5.1). 


6.1.3 Executing Code in the Forward Propagation Method 


The Sequential class makes model construction easy, allowing us to assemble new archi- 
tectures without having to define our own class. However, not all architectures are simple 
daisy chains. When greater flexibility is required, we will want to define our own blocks. 
For example, we might want to execute Python’s control flow within the forward propaga- 
tion method. Moreover, we might want to perform arbitrary mathematical operations, not 
simply relying on predefined neural network layers. 
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You may have noticed that until now, all of the operations in our networks have acted upon 
our network’s activations and its parameters. Sometimes, however, we might want to in- 
corporate terms that are neither the result of previous layers nor updatable parameters. We 
call these constant parameters. Say for example that we want a layer that calculates the 
function f(x, w) = c-w'x, where x is the input, w is our parameter, and c is some speci- 
fied constant that is not updated during optimization. So we implement a FixedHiddenMLP 
class as follows. 


class FixedHiddenMLP(nn.Module): 
def __init__(self): 
super().__init__Q 
# Random weight parameters that will not compute gradients and 
# therefore keep constant during training 
self.rand_weight = torch.rand((20, 20)) 
self.linear = nn.LazyLinear (20) 


def forward(self, X): 
X = self.linear(X) 
X = F.relu(X @ self.rand_weight + 1) 
# Reuse the fully connected layer. This is equivalent to sharing 
# parameters with two fully connected layers 
X = self.linear(X) 
# Control flow 
while X.abs().sum() > 1: 
X /= 2 
return X.sum() 


In this model, we implement a hidden layer whose weights (self .rand_weight) are ini- 
tialized randomly at instantiation and are thereafter constant. This weight is not a model 
parameter and thus it is never updated by backpropagation. The network then passes the 
output of this “fixed” layer through a fully connected layer. 


Note that before returning the output, our model did something unusual. We ran a while- 
loop, testing on the condition its fı norm is larger than 1, and dividing our output vector 
by 2 until it satisfied the condition. Finally, we returned the sum of the entries in X. To our 
knowledge, no standard neural network performs this operation. Note that this particular 
operation may not be useful in any real-world task. Our point is only to show you how to 
integrate arbitrary code into the flow of your neural network computations. 


net = FixedHiddenMLP() 
net(X) 


tensor (-0.3836, grad_fn=<SumBackwardQ>) 


We can mix and match various ways of assembling modules together. In the following 
example, we nest modules in some creative ways. 


class NestMLP(nn.Module): 


(continues on next page) 
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(continued from previous page) 


def __init__(self): 
super().__init__Q 
self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(), 
nn.LazyLinear(32), nn.ReLU()) 
self.linear = nn.LazyLinear (16) 


def forward(self, X): 
return self.linear(self.net(X)) 


chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP()) 
chimera(X) 


tensor(@.0679, grad_fn=<SumBackward@>) 


6.1.4 Summary 


Individual layers can be modules. Many layers can comprise a module. Many modules can 
comprise a module. 


A module can contain code. Modules take care of lots of housekeeping, including param- 
eter initialization and backpropagation. Sequential concatenations of layers and modules 
are handled by the Sequential module. 


6.1.5 Exercises 


1. What kinds of problems will occur if you change MySequential to store modules in a 
Python list? 


2. Implement a module that takes two modules as an argument, say net1 and net2 and 
returns the concatenated output of both networks in the forward propagation. This is 
also called a parallel module. 


3. Assume that you want to concatenate multiple instances of the same network. Imple- 
ment a factory function that generates multiple instances of the same module and build 
a larger network from it. 


Discussions!!! , 


6.2 Parameter Management 
a 


Once we have chosen an architecture and set our hyperparameters, we proceed to the train- 
ing loop, where our goal is to find parameter values that minimize our loss function. After 
training, we will need these parameters in order to make future predictions. Additionally, 
we will sometimes wish to extract the parameters perhaps to reuse them in some other 
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context, to save our model to disk so that it may be executed in other software, or for ex- 
amination in the hope of gaining scientific understanding. 


Most of the time, we will be able to ignore the nitty-gritty details of how parameters are 
declared and manipulated, relying on deep learning frameworks to do the heavy lifting. 
However, when we move away from stacked architectures with standard layers, we will 
sometimes need to get into the weeds of declaring and manipulating parameters. In this 
section, we cover the following: 


e Accessing parameters for debugging, diagnostics, and visualizations. 


e Sharing parameters across different model components. 


import torch 
from torch import nn 


We start by focusing on an MLP with one hidden layer. 


net = nn.Sequential(nn.LazyLinear (8) , 
nn.ReLU(), 
nn.LazyLinear(1)) 


X = torch.rand(size=(2, 4)) 
net (X).shape 


torch.Size([2, 1]) 


6.2.1 Parameter Access 
Let’s start with how to access parameters from the models that you already know. 
When a model is defined via the Sequential class, we can first access any layer by indexing 


into the model as though it were a list. Each layer’s parameters are conveniently located in 
its attribute. 


We can inspect the parameters of the second fully connected layer as follows. 


netl2].state_dict() 


OrderedDict([('weight’, 

tensor([[-0.1649, 0.0605, 0.1694, -0.2524, 0.3526, -0.3414, - 
0.2322, .0822]])), 

('bias’, tensor([@.0709]))]) 


We can see that this fully connected layer contains two parameters, corresponding to that 
layer’s weights and biases, respectively. 
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Targeted Parameters 


Note that each parameter is represented as an instance of the parameter class. To do any- 
thing useful with the parameters, we first need to access the underlying numerical values. 
There are several ways to do this. Some are simpler while others are more general. The 
following code extracts the bias from the second neural network layer, which returns a 
parameter class instance, and further accesses that parameter’s value. 


type(net[2].bias), net[2].bias.data 


(torch.nn.parameter.Parameter, tensor([@.0709])) 


Parameters are complex objects, containing values, gradients, and additional information. 
That is why we need to request the value explicitly. 


In addition to the value, each parameter also allows us to access the gradient. Because we 
have not invoked backpropagation for this network yet, it is in its initial state. 


net[2].weight.grad == None 


True 


All Parameters at Once 


When we need to perform operations on all parameters, accessing them one-by-one can 
grow tedious. The situation can grow especially unwieldy when we work with more com- 
plex, e.g., nested, modules, since we would need to recurse through the entire tree to extract 
each sub-module’s parameters. Below we demonstrate accessing the parameters of all lay- 
ers. 


[(name, param.shape) for name, param in net.named_parameters() ] 


[('@.weight’, torch.Size([8, 4])), 
('@.bias’, torch.Size([8])), 
('2.weight’, torch.Size([1, 8])), 
('2.bias’, torch.Size([1]))] 


6.2.2 Tied Parameters 


Often, we want to share parameters across multiple layers. Let’s see how to do this elegantly. 
In the following we allocate a fully connected layer and then use its parameters specifically 
to set those of another layer. Here we need to run the forward propagation net (X) before 
accessing the parameters. 


216 


112 


Builders’ Guide 


# We need to give the shared layer a name so that we can refer to its 
# parameters 
shared = nn.LazyLinear(8) 
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), 

shared, nn.ReLU(), 

shared, nn.ReLU(), 

nn.LazyLinear(1)) 
net (X) 
# Check whether the parameters are the same 
print(net[2].weight.datal0] == net[4].weight.data[l0]) 
net[2].weight.datal0, @] = 100 
# Make sure that they are actually the same object rather than just having the 


# same value 
print(net[2].weight.data[@] == net[4].weight.data[0]) 


tensor([True, True, True, True, True, True, True, True]) 
tensor([True, True, True, True, True, True, True, True]) 


This example shows that the parameters of the second and third layer are tied. They are 
not just equal, they are represented by the same exact tensor. Thus, if we change one of the 
parameters, the other one changes, too. 


You might wonder, when parameters are tied what happens to the gradients? Since the 
model parameters contain gradients, the gradients of the second hidden layer and the third 
hidden layer are added together during backpropagation. 


6.2.3 Summary 


We have several ways of accessing and tying model parameters. 


6.2.4 Exercises 


1. Use the NestMLP model defined in Section 6.1 and access the parameters of the various 
layers. 


2. Construct an MLP containing a shared parameter layer and train it. During the training 
process, observe the model parameters and gradients of each layer. 


3. Why is sharing parameters a good idea? 


Discussions !!?. 


6.3 Parameter Initialization 
S| 


Now that we know how to access the parameters, let’s look at how to initialize them prop- 
erly. We discussed the need for proper initialization in Section 5.4. The deep learning 
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framework provides default random initializations to its layers. However, we often want to 
initialize our weights according to various other protocols. The framework provides most 
commonly used protocols, and also allows to create a custom initializer. 


import torch 
from torch import nn 


By default, PyTorch initializes weight and bias matrices uniformly by drawing from a range 
that is computed according to the input and output dimension. PyTorch’s nn. init module 
provides a variety of preset initialization methods. 


net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1)) 
X = torch.rand(size=(2, 4)) 
net (X).shape 


torch.Size([2, 1]) 


6.3.1 Built-in Initialization 


Let’s begin by calling on built-in initializers. The code below initializes all weight parame- 
ters as Gaussian random variables with standard deviation 0.01, while bias parameters are 
cleared to zero. 


def init_normal (module): 
if type(module) == nn.Linear: 
nn.init.normal_(module.weight, mean=0, std=0.01) 
nn.init.zeros_(module.bias) 


net. apply(init_normal) 
net[Q].weight.datal0], netl0].bias.data[0] 


(tensor([-@.0129, -@.0007, -0.0033, .0276]), tensor(Q.)) 


We can also initialize all the parameters to a given constant value (say, 1). 


def init_constant (module): 
if type(module) == nn.Linear: 
nn.init.constant_(module.weight, 1) 
nn.init.zeros_(module.bias) 


net.apply(init_constant) 
net[Q].weight.datal0], netl0].bias.data[0] 


(tensor([1., 1., 1., 1.]), tensor(d.)) 


We can also apply different initializers for certain blocks. For example, below we initialize 
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the first layer with the Xavier initializer and initialize the second layer to a constant value 
of 42. 


def init_xavier(module): 
if type(module) == nn.Linear: 
nn.init.xavier_uniform_(module. weight) 


def init_42(module): 
if type(module) == nn.Linear: 
nn.init.constant_(module.weight, 42) 


netlQ].apply(init_xavier) 
net[2].apply(init_42) 
print(net[@].weight.data[@]) 
print(net[2].weight.data) 


tensor([-0.0974, 0.1707, 0.5840, -0.5032]) 
tensor([[42., 42., 42., 42., 42., 42., 42., 42.7) 


Custom Initialization 


Sometimes, the initialization methods we need are not provided by the deep learning frame- 
work. In the example below, we define an initializer for any weight parameter w using the 
following strange distribution: 


U(5,10) with probability 
w~ 40 with probability 
U(-10,-5) with probability 


(6.3.1) 


Ale Nie Ale 


Again, we implement a my_init function to apply to net. 


def my_init(module): 
if type(module) == nn.Linear: 
print("Init”, *[(name, param. shape) 
for name, param in module.named_parameters()]1[0]) 
nn.init.uniform_(module.weight, -10, 10) 
module.weight.data x= module.weight.data.abs() >= 5 


net.apply(my_init) 
netlQ].weightL:2] 


Init weight torch.Size([8, 4]) 
Init weight torch.Size([1, 8]) 


tensor([[ 0.0000, -7.6364, -0.0000, -6.1206], 
[ 9.3516, -0.0000, 5.1208, -8.4003]], grad_fn=<SliceBackward@>) 


Note that we always have the option of setting parameters directly. 
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netl0].weight.data[:] += 1 
netlQ].weight.datal0, 0] = 42 
net[Q].weight.datalQ] 


tensor ([42.0000, -6.6364, 1.0000, -5.1206]) 


6.3.2 Summary 


We can initialize parameters using built-in and custom initializers. 


6.3.3 Exercises 


Look up the online documentation for more built-in initializers. 


Discussions !!°. 


6.4 Lazy Initialization 
i 


So far, it might seem that we got away with being sloppy in setting up our networks. Specif- 
ically, we did the following unintuitive things, which might not seem like they should 
work: 


e We defined the network architectures without specifying the input dimensionality. 
e We added layers without specifying the output dimension of the previous layer. 


e We even “initialized” these parameters before providing enough information to deter- 
mine how many parameters our models should contain. 


You might be surprised that our code runs at all. After all, there is no way the deep learning 
framework could tell what the input dimensionality of a network would be. The trick here 
is that the framework defers initialization, waiting until the first time we pass data through 
the model, to infer the sizes of each layer on the fly. 


Later on, when working with convolutional neural networks, this technique will become 
even more convenient since the input dimensionality (e.g., the resolution of an image) will 
affect the dimensionality of each subsequent layer. Hence the ability to set parameters 
without the need to know, at the time of writing the code, the value of the dimension can 
greatly simplify the task of specifying and subsequently modifying our models. Next, we 
go deeper into the mechanics of initialization. 


import torch 
from torch import nn 
from d21 import torch as d21 
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To begin, let’s instantiate an MLP. 


net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(1@)) 


At this point, the network cannot possibly know the dimensions of the input layer’s weights 
because the input dimension remains unknown. 


Consequently the framework has not yet initialized any parameters. We confirm by at- 
tempting to access the parameters below. 


netl0].weight 


<UninitializedParameter> 


Next let’s pass data through the network to make the framework finally initialize parame- 
ters. 


X = torch.rand(2, 20) 
net (X) 


net[Q].weight. shape 


torch.Size([256, 20]) 


As soon as we know the input dimensionality, 20, the framework can identify the shape of 
the first layer’s weight matrix by plugging in the value of 20. Having recognized the first 
layer’s shape, the framework proceeds to the second layer, and so on through the computa- 
tional graph until all shapes are known. Note that in this case, only the first layer requires 
lazy initialization, but the framework initializes sequentially. Once all parameter shapes 
are known, the framework can finally initialize the parameters. 


The following method passes in dummy inputs through the network for a dry run to infer 
all parameter shapes and subsequently initializes the parameters. It will be used later when 
default random initializations are not desired. 


@d21.add_to_class(d21.Module) #@save 
def apply_init(self, inputs, init=None): 
self. forward(*inputs) 
if init is not None: 
self.net.apply(init) 


6.4.1 Summary 


Lazy initialization can be convenient, allowing the framework to infer parameter shapes 
automatically, making it easy to modify architectures and eliminating one common source 
of errors. We can pass data through the model to make the framework finally initialize 
parameters. 
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6.4.2 Exercises 


1. What happens if you specify the input dimensions to the first layer but not to subsequent 
layers? Do you get immediate initialization? 


2. What happens if you specify mismatching dimensions? 


3. What would you need to do if you have input of varying dimensionality? Hint: look at 
the parameter tying. 


Discussions !!4, 


6.5 Custom Layers 
EE) 


One factor behind deep learning’s success is the availability of a wide range of layers that 
can be composed in creative ways to design architectures suitable for a wide variety of 
tasks. For instance, researchers have invented layers specifically for handling images, text, 
looping over sequential data, and performing dynamic programming. Sooner or later, you 
will need a layer that does not exist yet in the deep learning framework. In these cases, you 
must build a custom layer. In this section, we show you how. 


import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


6.5.1 Layers without Parameters 


To start, we construct a custom layer that does not have any parameters of its own. This 
should look familiar if you recall our introduction to modules in Section 6.1. The following 
CenteredLayer class simply subtracts the mean from its input. To build it, we simply need 
to inherit from the base layer class and implement the forward propagation function. 


class CenteredLayer(nn.Module) : 
def __init__(self): 
super().__init__Q 


def forward(self, X): 
return X - X.mean() 


Let’s verify that our layer works as intended by feeding some data through it. 


layer = CenteredLayer() 
layer(torch.tensor([1.@, 2, 3, 4, 5])) 
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tensor([-2., -1., @., 1., 2.]) 


We can now incorporate our layer as a component in constructing more complex mod- 
els. 


net = nn.Sequential(nn.LazyLinear(128), CenteredLayer()) 


As an extra sanity check, we can send random data through the network and check that the 
mean is in fact 0. Because we are dealing with floating point numbers, we may still see a 
very small nonzero number due to quantization. 


Y = net(torch.rand(4, 8)) 
Y.mean() 


tensor(-6.5193e-09, grad_fn=<MeanBackwardQ>) 


6.5.2 Layers with Parameters 


Now that we know how to define simple layers, let’s move on to defining layers with pa- 
rameters that can be adjusted through training. We can use built-in functions to create 
parameters, which provide some basic housekeeping functionality. In particular, they gov- 
ern access, initialization, sharing, saving, and loading model parameters. This way, among 
other benefits, we will not need to write custom serialization routines for every custom 
layer. 


Now let’s implement our own version of the fully connected layer. Recall that this layer 
requires two parameters, one to represent the weight and the other for the bias. In this im- 
plementation, we bake in the ReLU activation as a default. This layer requires two input 
arguments: in_units and units, which denote the number of inputs and outputs, respec- 
tively. 


class MyLinear(nn.Module) : 
def __init__(self, in_units, units): 
super().__init__Q 
self.weight = nn.Parameter(torch.randn(in_units, units)) 
self.bias = nn.Parameter(torch.randn(units, )) 


def forward(self, X): 


linear = torch.matmul(X, self.weight.data) + self.bias.data 
return F.relu(linear) 


Next, we instantiate the MyLinear class and access its model parameters. 


linear = MyLinear(5, 3) 
linear.weight 
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Parameter containing: 
tensor([L 0.4783, 0.4284, -@.0899], 
[-0.6347, 0.2913, -@.0822], 
[-0.4325, -@.1645, -0.3274], 
[ 1.1898, 0.6482, -1.2384], 
[-@.1479, 0.0264, -@.9597]], requires_grad=True) 


We can directly carry out forward propagation calculations using custom layers. 


linear(torch.rand(2, 5)) 


tensor ([[0.0000, 0.9316, 0.0000], 
[@.1808, 1.4208, 0.0000]]) 


We can also construct models using custom layers. Once we have that we can use it just 
like the built-in fully connected layer. 


net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1)) 
net(torch.rand(2, 64)) 


tensor([L 0.2000], 
[13.0800]]) 


6.5.3 Summary 


We can design custom layers via the basic layer class. This allows us to define flexible 
new layers that behave differently from any existing layers in the library. Once defined, 
custom layers can be invoked in arbitrary contexts and architectures. Layers can have local 
parameters, which can be created through built-in functions. 


6.5.4 Exercises 


1. Design a layer that takes an input and computes a tensor reduction, i.e., it returns yg = 
Dies WijkXiXj. 
2. Design a layer that returns the leading half of the Fourier coefficients of the data. 


Discussions 115. 


6.6 File I/O 
ES) 


So far we have discussed how to process data and how to build, train, and test deep learn- 
ing models. However, at some point we will hopefully be happy enough with the learned 
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models that we will want to save the results for later use in various contexts (perhaps even 
to make predictions in deployment). Additionally, when running a long training process, 
the best practice is to periodically save intermediate results (checkpointing) to ensure that 
we do not lose several days’ worth of computation if we trip over the power cord of our 
server. Thus it is time to learn how to load and store both individual weight vectors and 
entire models. This section addresses both issues. 


import torch 
from torch import nn 
from torch.nn import functional as F 


6.6.1 Loading and Saving Tensors 


For individual tensors, we can directly invoke the load and save functions to read and 
write them respectively. Both functions require that we supply a name, and save requires 
as input the variable to be saved. 


x = torch. arange(4) 
torch.save(x, 'x-file’) 


We can now read the data from the stored file back into memory. 


x2 = torch.load('x-file’) 
x2 


tensor(L@, 1, 2, 3]) 


We can store a list of tensors and read them back into memory. 


y = torch. zeros(4) 
torch.save([x, y],’x-files’) 
x2, y2 = torch.load('x-files’) 
(x2, y2) 


(tensor(L@, 1, 2, 3]), tensor([@., ©., ©., @.])) 


We can even write and read a dictionary that maps from strings to tensors. This is conve- 
nient when we want to read or write all the weights in a model. 


mydict = foxes 3g, Sy eB ay} 
torch.save(mydict, ‘mydict’) 
mydict2 = torch. load('mydict') 
mydict2 
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{'x': tensor(L@, 1, 2, 3]), ‘y’: tensor([@., ð., ð., @.])} 


6.6.2 Loading and Saving Model Parameters 


Saving individual weight vectors (or other tensors) is useful, but it gets very tedious if we 
want to save (and later load) an entire model. After all, we might have hundreds of param- 
eter groups sprinkled throughout. For this reason the deep learning framework provides 
built-in functionalities to load and save entire networks. An important detail to note is that 
this saves model parameters and not the entire model. For example, if we have a 3-layer 
MLP, we need to specify the architecture separately. The reason for this is that the models 
themselves can contain arbitrary code, hence they cannot be serialized as naturally. Thus, 
in order to reinstate a model, we need to generate the architecture in code and then load the 
parameters from disk. Let’s start with our familiar MLP. 


class MLP(nn.Module) : 
def __init__(self): 
super().__init__Q 
self .hidden = nn.LazyLinear (256) 
self.output = nn.LazyLinear (10) 


def forward(self, x): 
return self.output(F.relu(self.hidden(x))) 


net = MLP() 
X = torch.randn(size=(2, 20)) 
Y = net(X) 


Next, we store the parameters of the model as a file with the name “mlp.params”’. 


torch.save(net.state_dict(), 'mlp.params') 


To recover the model, we instantiate a clone of the original MLP model. Instead of ran- 
domly initializing the model parameters, we read the parameters stored in the file directly. 


clone = MLP() 
clone. load_state_dict(torch. load(’mlp.params’)) 
clone.eval() 


MLP( 
(hidden): LazyLinear(in_features=0, out_features=256, bias=True) 
(output): LazyLinear(in_features=0, out_features=10, bias=True) 


) 


Since both instances have the same model parameters, the computational result of the same 
input X should be the same. Let's verify this. 
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Y_clone = clone(X) 
Y_clone == Y 


tensor([[True, True, True, True, True, True, True, True, True, True], 
[True, True, True, True, True, True, True, True, True, True]]) 


6.6.3 Summary 


The save and load functions can be used to perform file I/O for tensor objects. We can 
save and load the entire sets of parameters for a network via a parameter dictionary. Saving 
the architecture has to be done in code rather than in parameters. 


6.6.4 Exercises 


1. Even if there is no need to deploy trained models to a different device, what are the 
practical benefits of storing model parameters? 


2. Assume that we want to reuse only parts of a network to be incorporated into a network 
having a different architecture. How would you go about using, say the first two layers 
from a previous network in a new network? 


3. How would you go about saving the network architecture and parameters? What restric- 
tions would you impose on the architecture? 


Discussions !!°. 


6.7 GPUs 
es 


In tab_intro_decade, we illustrated the rapid growth of computation over the past two 
decades. In a nutshell, GPU performance has increased by a factor of 1000 every decade 
since 2000. This offers great opportunities but it also suggests that there was significant 
demand for such performance. 


In this section, we begin to discuss how to harness this computational performance for your 
research. First by using a single GPU and at a later point, how to use multiple GPUs and 
multiple servers (with multiple GPUs). 


Specifically, we will discuss how to use a single NVIDIA GPU for calculations. First, 
make sure you have at least one NVIDIA GPU installed. Then, download the NVIDIA 
driver and CUDA +!" and follow the prompts to set the appropriate path. Once these prepa- 
rations are complete, the nvidia-smi command can be used to view the graphics card 
information. 


In PyTorch, every array has a device; we often refer it as a context. So far, by default, all 
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variables and associated computation have been assigned to the CPU. Typically, other con- 
texts might be various GPUs. Things can get even hairier when we deploy jobs across mul- 
tiple servers. By assigning arrays to contexts intelligently, we can minimize the time spent 
transferring data between devices. For example, when training neural networks on a server 
with a GPU, we typically prefer for the model’s parameters to live on the GPU. 


To run the programs in this section, you need at least two GPUs. Note that this might 
be extravagant for most desktop computers but it is easily available in the cloud, e.g., by 
using the AWS EC2 multi-GPU instances. Almost all other sections do not require multiple 
GPUs, but here we simply wish to illustrate data flow between different devices. 


import torch 
from torch import nn 
from d21 import torch as d21 


6.7.1 Computing Devices 


We can specify devices, such as CPUs and GPUs, for storage and calculation. By default, 
tensors are created in the main memory and then the CPU is used for calculations. 


In PyTorch, the CPU and GPU can be indicated by torch.device('cpu') and torch. 
device(’cuda'). It should be noted that the cpu device means all physical CPUs and 
memory. This means that PyTorch’s calculations will try to use all CPU cores. However, a 
gpu device only represents one card and the corresponding memory. If there are multiple 
GPUs, we use torch. device(f ’cuda:{i}’) to represent the i GPU (i starts at 0). Also, 
gpu:@ and gpu are equivalent. 


def cpu(): #@save 
"""Get the CPU device.""”" 
return torch.device(’cpu') 


def gpu(i=0): #@save 


"""Get a GPU device.”"” 
return torch.device(f'cuda:{i}’) 


cpu(), gpu(), gpu(1) 
(device(type='cpu'), 


device(type='cuda’, index=@), 
device(type='cuda’, index=1)) 


We can query the number of available GPUs. 


def num_gpus(): #@save 
"""Get the number of available GPUs.””” 
return torch.cuda.device_count() 


num_gpus() 
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Now we define two convenient functions that allow us to run code even if the requested 
GPUs do not exist. 


def try_gpu(i=0): #@save 
"""Return gpu(i) if exists, otherwise return cpu(). 
if num_gpus() >= i + 1: 
return gpu(i) 
return cpu() 


nnn 


def try_all_gpus(): #@save 
"""Return all available GPUs, or [cpu(),] if no GPU exists.”"” 
return [gpu(i) for i in range(num_gpus())1] 


try_gpu(), try_gpu(10), try_all_gpus() 


(device(type='cuda’, index=0), 
device(type='cpu'), 
[device(type='cuda’, index=0), device(type='cuda’, index=1)]) 


6.7.2 Tensors and GPUs 


By default, tensors are created on the CPU. We can query the device where the tensor is 
located. 


x = torch.tensor([1, 2, 3]) 
x. device 


device(type='cpu') 


It is important to note that whenever we want to operate on multiple terms, they need to be 
on the same device. For instance, if we sum two tensors, we need to make sure that both 
arguments live on the same device—otherwise the framework would not know where to 
store the result or even how to decide where to perform the computation. 


Storage on the GPU 


There are several ways to store a tensor on the GPU. For example, we can specify a stor- 
age device when creating a tensor. Next, we create the tensor variable X on the first gpu. 
The tensor created on a GPU only consumes the memory of this GPU. We can use the 
nvidia-smi command to view GPU memory usage. In general, we need to make sure that 
we do not create data that exceeds the GPU memory limit. 


X = torch.ones(2, 3, device=try_gpu()) 
X 
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tensor([[1., 1., 


1s], 
[1., 1., 1.]], device='cuda:@’) 


Assuming that you have at least two GPUs, the following code will create a random tensor, 
Y, on the second GPU. 


Y = torch.rand(2, 3, device=try_gpu(1)) 
Y 


tensor([[0.0022, 0.5723, 0.2890], 
[0.1456, 0.3537, 0.73591], device='cuda:1’) 


Copying 


If we want to compute X + Y, we need to decide where to perform this operation. For 
instance, as shown in Fig. 6.7.1, we can transfer X to the second GPU and perform the 
operation there. Do not simply add X and Y, since this will result in an exception. The 
runtime engine would not know what to do: it cannot find data on the same device and it 
fails. Since Y lives on the second GPU, we need to move X there before we can add the 
two. 


copy 


gpu(0) 


Copy data to perform an operation on the same device. 


Z = X.cuda(1) 
print(X) 
print(Z) 
tensor([[1., 1., 1.], 
[1., 1., 1.]], device='cuda:@') 
tensor(L[1., 1., 1.], 
[1., 1., 1.]], device='’cuda:1") 


Now that the data (both Z and Y) are on the same GPU), we can add them up. 


Vise 72 


tensor([[1.0022, 1.5723, 1.2890], 
[1.1456, 1.3537, 1.7359]], device='cuda:1') 
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But what if your variable Z already lived on your second GPU? What happens if we still call 
Z.cuda(1)? It will return Z instead of making a copy and allocating new memory. 


Z.cuda(1) is Z 


True 


Side Notes 


People use GPUs to do machine learning because they expect them to be fast. But trans- 
ferring variables between devices is slow: much slower than computation. So we want you 
to be 100% certain that you want to do something slow before we let you do it. If the deep 
learning framework just did the copy automatically without crashing then you might not 
realize that you had written some slow code. 


Transferring data is not only slow, it also makes parallelization a lot more difficult, since 
we have to wait for data to be sent (or rather to be received) before we can proceed with 
more operations. This is why copy operations should be taken with great care. As a rule of 
thumb, many small operations are much worse than one big operation. Moreover, several 
operations at a time are much better than many single operations interspersed in the code 
unless you know what you are doing. This is the case since such operations can block if 
one device has to wait for the other before it can do something else. It is a bit like ordering 
your coffee in a queue rather than pre-ordering it by phone and finding out that it is ready 
when you are. 


Last, when we print tensors or convert tensors to the NumPy format, if the data is not in the 
main memory, the framework will copy it to the main memory first, resulting in additional 
transmission overhead. Even worse, it is now subject to the dreaded global interpreter lock 
that makes everything wait for Python to complete. 


6.7.3 Neural Networks and GPUs 


Similarly, a neural network model can specify devices. The following code puts the model 
parameters on the GPU. 


net = nn.Sequential(nn.LazyLinear(1)) 
net = net.to(device=try_gpu()) 


We will see many more examples of how to run models on GPUs in the following chapters, 
simply because the models will become somewhat more computationally intensive. 


For example, when the input is a tensor on the GPU, the model will calculate the result on 
the same GPU. 


net (X) 
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tensor (LL[@. 7802], 
[@.7802]], device='cuda:@', grad_fn=<AddmmBackwardQ>) 


Let’s confirm that the model parameters are stored on the same GPU. 


netlQ].weight.data.device 


device(type='cuda’, index=0) 


Let the trainer support GPU. 


@d21.add_to_class(d21.Trainer) #@save 
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=@): 
self .save_hyperparameters() 
self.gpus = [d21.gpu(i) for i in range(min(num_gpus, d21.num_gpus()))1] 


@d21.add_to_class(d21.Trainer) #@save 
def prepare_batch(self, batch): 
if self.gpus: 
batch = [a.to(self.gpus[0]) for a in batch] 
return batch 


@d21.add_to_class(d21.Trainer) #@save 
def prepare_model(self, model): 
model.trainer = self 
model.board.xlim = [@, self.max_epochs] 
if self.gpus: 
model. to(self.gpus[@]) 
self.model = model 


In short, as long as all data and parameters are on the same device, we can learn models 
efficiently. In the following chapters we will see several such examples. 


6.7.4 Summary 


We can specify devices for storage and calculation, such as the CPU or GPU. By default, 
data is created in the main memory and then uses the CPU for calculations. The deep 
learning framework requires all input data for calculation to be on the same device, be it 
CPU or the same GPU. You can lose significant performance by moving data without care. 
A typical mistake is as follows: computing the loss for every minibatch on the GPU and 
reporting it back to the user on the command line (or logging it ina NumPy ndarray) will 
trigger a global interpreter lock which stalls all GPUs. It is much better to allocate memory 
for logging inside the GPU and only move larger logs. 


6.7.5 Exercises 


1. Try a larger computation task, such as the multiplication of large matrices, and see the 
difference in speed between the CPU and GPU. What about a task with a small number 
of calculations? 
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2. How should we read and write model parameters on the GPU? 


3. Measure the time it takes to compute 1000 matrix—matrix multiplications of 100 x 100 
matrices and log the Frobenius norm of the output matrix one result at a time. Compare 
it with keeping a log on the GPU and transferring only the final result. 


4. Measure how much time it takes to perform two matrix—matrix multiplications on two 
GPUs at the same time. Compare it with computing in in sequence on one GPU. Hint: 
you should see almost linear scaling. 


Discussions !!8. 


Convolutional Neural Networks 


Image data is represented as a two-dimensional grid of pixels, be the image monochro- 
matic or in color. Accordingly each pixel corresponds to one or multiple numerical values 
respectively. So far we have ignored this rich structure and treated images as vectors of 
numbers by flattening them, irrespective of the spatial relation between pixels. This deeply 
unsatisfying approach was necessary in order to feed the resulting one-dimensional vectors 
through a fully connected MLP. 


Because these networks are invariant to the order of the features, we could get similar 
results regardless of whether we preserve an order corresponding to the spatial structure 
of the pixels or if we permute the columns of our design matrix before fitting the MLP’s 
parameters. Ideally, we would leverage our prior knowledge that nearby pixels are typically 
related to each other, to build efficient models for learning from image data. 


This chapter introduces convolutional neural networks (CNNs) (LeCun et al., 1995), a 
powerful family of neural networks that are designed for precisely this purpose. CNN- 
based architectures are now ubiquitous in the field of computer vision. For instance, on the 
Imagnet collection (Deng et al., 2009) it was only the use of convolutional neural networks, 
in short Convnets, that provided significant performance improvements (Krizhevsky et al., 
2012). 


Modern CNNs, as they are called colloquially, owe their design to inspirations from biol- 
ogy, group theory, and a healthy dose of experimental tinkering. In addition to their sample 
efficiency in achieving accurate models, CNNs tend to be computationally efficient, both 
because they require fewer parameters than fully connected architectures and because con- 
volutions are easy to parallelize across GPU cores (Chetlur et al., 2014). Consequently, 
practitioners often apply CNNs whenever possible, and increasingly they have emerged 
as credible competitors even on tasks with a one-dimensional sequence structure, such as 
audio (Abdel-Hamid et al., 2014), text (Kalchbrenner et al., 2014), and time series analy- 
sis (LeCun et al., 1995), where recurrent neural networks are conventionally used. Some 
clever adaptations of CNNs have also brought them to bear on graph-structured data (Kipf 
and Welling, 2016) and in recommender systems. 


First, we will dive more deeply into the motivation for convolutional neural networks. This 
is followed by a walk through the basic operations that comprise the backbone of all con- 
volutional networks. These include the convolutional layers themselves, nitty-gritty details 
including padding and stride, the pooling layers used to aggregate information across ad- 
jacent spatial regions, the use of multiple channels at each layer, and a careful discussion 
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of the structure of modern architectures. We will conclude the chapter with a full working 
example of LeNet, the first convolutional network successfully deployed, long before the 
rise of modern deep learning. In the next chapter, we will dive into full implementations of 
some popular and comparatively recent CNN architectures whose designs represent most 
of the techniques commonly used by modern practitioners. 


7.1 From Fully Connected Layers to Convolutions 
ee 


To this day, the models that we have discussed so far remain appropriate options when we 
are dealing with tabular data. By tabular, we mean that the data consist of rows corre- 
sponding to examples and columns corresponding to features. With tabular data, we might 
anticipate that the patterns we seek could involve interactions among the features, but we 
do not assume any structure a priori concerning how the features interact. 


Sometimes, we truly lack the knowledge to be able to guide the construction of fancier 
architectures. In these cases, an MLP may be the best that we can do. However, for high- 
dimensional perceptual data, such structureless networks can grow unwieldy. 


For instance, let’s return to our running example of distinguishing cats from dogs. Say that 
we do a thorough job in data collection, collecting an annotated dataset of one-megapixel 
photographs. This means that each input to the network has one million dimensions. Even 
an aggressive reduction to one thousand hidden dimensions would require a fully connected 
layer characterized by 10°x 10° = 10° parameters. Unless we have lots of GPUs, a talent for 
distributed optimization, and an extraordinary amount of patience, learning the parameters 
of this network may turn out to be infeasible. 


A careful reader might object to this argument on the basis that one megapixel resolution 
may not be necessary. However, while we might be able to get away with one hundred 
thousand pixels, our hidden layer of size 1000 grossly underestimates the number of hid- 
den units that it takes to learn good representations of images, so a practical system will 
still require billions of parameters. Moreover, learning a classifier by fitting so many pa- 
rameters might require collecting an enormous dataset. And yet today both humans and 
computers are able to distinguish cats from dogs quite well, seemingly contradicting these 
intuitions. That is because images exhibit rich structure that can be exploited by humans 
and machine learning models alike. Convolutional neural networks (CNNs) are one cre- 
ative way that machine learning has embraced for exploiting some of the known structure 
in natural images. 


7.1.1 Invariance 


Imagine that we want to detect an object in an image. It seems reasonable that whatever 
method we use to recognize objects should not be overly concerned with the precise location 
of the object in the image. Ideally, our system should exploit this knowledge. Pigs usually 
do not fly and planes usually do not swim. Nonetheless, we should still recognize a pig 
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were one to appear at the top of the image. We can draw some inspiration here from the 
children’s game “Where’s Waldo” (which itself has inspired many real-life imitations, such 
as that depicted in Fig. 7.1.1). The game consists of a number of chaotic scenes bursting 
with activities. Waldo shows up somewhere in each, typically lurking in some unlikely 
location. The reader’s goal is to locate him. Despite his characteristic outfit, this can be 
surprisingly difficult, due to the large number of distractions. However, what Waldo looks 
like does not depend upon where Waldo is located. We could sweep the image with a Waldo 
detector that could assign a score to each patch, indicating the likelihood that the patch 
contains Waldo. In fact, many object detection and segmentation algorithms are based 
on this approach (Long et al., 2015). CNNs systematize this idea of spatial invariance, 
exploiting it to learn useful representations with fewer parameters. 


We can now make these intuitions more concrete by enumerating a few desiderata to guide 
our design of a neural network architecture suitable for computer vision: 


1. In the earliest layers, our network should respond similarly to the same patch, regardless 
of where it appears in the image. This principle is called translation invariance (or 
translation equivariance). 


2. The earliest layers of the network should focus on local regions, without regard for the 
contents of the image in distant regions. This is the locality principle. Eventually, these 
local representations can be aggregated to make predictions at the whole image level. 


3. As we proceed, deeper layers should be able to capture longer-range features of the 
image, in a way similar to higher level vision in nature. 


Let’s see how this translates into mathematics. 


7.1.2 Constraining the MLP 


To start off, we can consider an MLP with two-dimensional images X as inputs and their im- 
mediate hidden representations H similarly represented as matrices (they are two-dimensional 
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tensors in code), where both X and H have the same shape. Let that sink in. We now 
imagine that not only the inputs but also the hidden representations possess spatial struc- 
ture. 


Let [X];,; and [H];,; denote the pixel at location (i, j) in the input image and hidden rep- 
resentation, respectively. Consequently, to have each of the hidden units receive input from 
each of the input pixels, we would switch from using weight matrices (as we did previously 
in MLPs) to representing our parameters as fourth-order weight tensors W. Suppose that 
U contains biases, we could formally express the fully connected layer as 


[H]; = [Uly + > Wj eX 
k l 


(7.1.1) 
= [U]; ; + ` S Virat [X]i+a,j+b- 
a b 


The switch from W to V is entirely cosmetic for now since there is a one-to-one correspon- 
dence between coefficients in both fourth-order tensors. We simply re-index the subscripts 
(k, 1) such that k =i+aand/ = j + b. In other words, we set [V]; j a,b = [W]; j, i+a,j+b- 
The indices a and b run over both positive and negative offsets, covering the entire image. 
For any given location (i, j) in the hidden representation [H];,;, we compute its value by 
summing over pixels in x, centered around (i, j) and weighted by [V]j,;,a,5. Before we 
carry on, let’s consider the total number of parameters required for a single layer in this 
parametrization: a 1000 x 1000 image (1 megapixel) is mapped to a 1000 x 1000 hidden 
representation. This requires 10!? parameters, far beyond what computers currently can 
handle. 


Translation Invariance 


Now let’s invoke the first principle established above: translation invariance (Zhang et al., 
1988). This implies that a shift in the input X should simply lead to a shift in the hidden 
representation H. This is only possible if V and U do not actually depend on (i, j). As 
such, we have [V];,j;,a,5 = [W]a,p and U is a constant, say u. As a result, we can simplify 
the definition for H: 


[H]; ; = ut >) [Vap [X]i+sa,j+b- (7.1.2) 
a b 


This is a convolution! We are effectively weighting pixels at (i +a, j + b) in the vicinity of 
location (i, j) with coefficients [V ]a,» to obtain the value [H];,;. Note that [V ]a,p needs 
many fewer coefficients than [V]j,;,a,5 since it no longer depends on the location within 
the image. Consequently, the number of parameters required is no longer 10!” but a much 
more reasonable 4 x 10°: we still have the dependency on a, b € (—1000, 1000). In short, 
we have made significant progress. Time-delay neural networks (TDNNs) are some of the 
first examples to exploit this idea (Waibel et al., 1989). 


Locality 


Now let’s invoke the second principle: locality. As motivated above, we believe that we 
should not have to look very far away from location (i, j) in order to glean relevant infor- 
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mation to assess what is going on at [H]; ;. This means that outside some range |a| > A 
or |b| > A, we should set [V]a,p = 0. Equivalently, we can rewrite [H];,; as 


A A 
[H] sut DS) >) [Vao[Xlisa,jso- (7.1.3) 
a=- b=-A 

This reduces the number of parameters from 4x 10° to 4A?, where A is typically smaller than 
10. As such, we reduced the number of parameters by another four orders of magnitude. 
Note that (7.1.3), is what is called, in a nutshell, a convolutional layer. Convolutional 
neural networks (CNNs) are a special family of neural networks that contain convolutional 
layers. In the deep learning research community, V is referred to as a convolution kernel, 
a filter, or simply the layer’s weights that are learnable parameters. 


While previously, we might have required billions of parameters to represent just a single 
layer in an image-processing network, we now typically need just a few hundred, without 
altering the dimensionality of either the inputs or the hidden representations. The price 
paid for this drastic reduction in parameters is that our features are now translation invariant 
and that our layer can only incorporate local information, when determining the value of 
each hidden activation. All learning depends on imposing inductive bias. When that bias 
agrees with reality, we get sample-efficient models that generalize well to unseen data. But 
of course, if those biases do not agree with reality, e.g., if images turned out not to be 
translation invariant, our models might struggle even to fit our training data. 


This dramatic reduction in parameters brings us to our last desideratum, namely that deeper 
layers should represent larger and more complex aspects of an image. This can be achieved 
by interleaving nonlinearities and convolutional layers repeatedly. 


7.1.3 Convolutions 


Let’s briefly review why (7.1.3) is called a convolution. In mathematics, the convolution 
between two functions (Rudin, 1973), say f, g : R? — R is defined as 


(f #2) = / fla)e(x—2)dz. (7.1.4) 


That is, we measure the overlap between f and g when one function is “flipped” and shifted 
by x. Whenever we have discrete objects, the integral turns into a sum. For instance, for 
vectors from the set of square-summable infinite-dimensional vectors with index running 
over Z we obtain the following definition: 


(f* a) =)" fagli- a). (7.1.5) 


For two-dimensional tensors, we have a corresponding sum with indices (a, b) for f and 
(i — a, j — b) for g, respectively: 
(fx ais) = X, > f(a, b)g(i- a, j- b). (7.1.6) 
a b 


This looks similar to (7.1.3), with one major difference. Rather than using (i +a, j + b), 
we are using the difference instead. Note, though, that this distinction is mostly cosmetic 
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since we can always match the notation between (7.1.3) and (7.1.6). Our original definition 
in (7.1.3) more properly describes a cross-correlation. We will come back to this in the 
following section. 


7.1.4 Channels 


Returning to our Waldo detector, let’s see what this looks like. The convolutional layer picks 
windows of a given size and weighs intensities according to the filter V, as demonstrated 
in Fig. 7.1.2. We might aim to learn a model so that wherever the “waldoness” is highest, 
we should find a peak in the hidden layer representations. 


£ > 
© ao 
Detect Waldo (image courtesy of William Murphy (Infomatique)). 


There is just one problem with this approach. So far, we blissfully ignored that images 
consist of three channels: red, green, and blue. In sum, images are not two-dimensional 
objects but rather third-order tensors, characterized by a height, width, and channel, e.g., 
with shape 1024 x 10243 pixels. While the first two of these axes concern spatial relation- 
ships, the third can be regarded as assigning a multidimensional representation to each pixel 
location. We thus index X as [X]j,;,x. The convolutional filter has to adapt accordingly. 
Instead of [V]a,p, we now have [V]a,b,c. 


Moreover, just as our input consists of a third-order tensor, it turns out to be a good idea 
to similarly formulate our hidden representations as third-order tensors H. In other words, 
rather than just having a single hidden representation corresponding to each spatial location, 
we want an entire vector of hidden representations corresponding to each spatial location. 
We could think of the hidden representations as comprising a number of two-dimensional 
grids stacked on top of each other. As in the inputs, these are sometimes called channels. 
They are also sometimes called feature maps, as each provides a spatialized set of learned 
features for the subsequent layer. Intuitively, you might imagine that at lower layers that are 
closer to inputs, some channels could become specialized to recognize edges while others 
could recognize textures. 


To support multiple channels in both inputs (X) and hidden representations (H), we can add 
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a fourth coordinate to V: [V]a,b,c,a. Putting everything together we have: 


A A 
[Higa > a Nerau (7.1.7) 


a=-A b=-A c 


where d indexes the output channels in the hidden representations H. The subsequent con- 
volutional layer will go on to take a third-order tensor, H, as input. We take (7.1.7), because 
of its generality, as the definition of a convolutional layer for multiple channels, where V 
is a kernel or filter of the layer. 


There are still many operations that we need to address. For instance, we need to figure out 
how to combine all the hidden representations to a single output, e.g., whether there is a 
Waldo anywhere in the image. We also need to decide how to compute things efficiently, 
how to combine multiple layers, appropriate activation functions, and how to make reason- 
able design choices to yield networks that are effective in practice. We turn to these issues 
in the remainder of the chapter. 


7.1.5 Summary and Discussion 


In this section we derived the structure of convolutional neural networks from first prin- 
ciples. While it is unclear whether this was the route taken to the invention of CNNs, it 
is satisfying to know that they are the right choice when applying reasonable principles 
to how image processing and computer vision algorithms should operate, at least at lower 
levels. In particular, translation invariance in images implies that all patches of an image 
will be treated in the same manner. Locality means that only a small neighborhood of pix- 
els will be used to compute the corresponding hidden representations. Some of the earliest 
references to CNNs are in the form of the Neocognitron (Fukushima, 1982). 


A second principle that we encountered in our reasoning is how to reduce the number of 
parameters in a function class without limiting its expressive power, at least, whenever 
certain assumptions on the model hold. We saw a dramatic reduction of complexity as a 
result of this restriction, turning computationally and statistically infeasible problems into 
tractable models. 


Adding channels allowed us to bring back some of the complexity that was lost due to the re- 
strictions imposed on the convolutional kernel by locality and translation invariance. Note 
that it is quite natural to add channels other than just red, green, and blue. Many satellite 
images, in particular for agriculture and meteorology, have tens to hundreds of channels, 
generating hyperspectral images instead. They report data on many different wavelengths. 
In the following we will see how to use convolutions effectively to manipulate the dimen- 
sionality of the images they operate on, how to move from location-based to channel-based 
representations, and how to deal with large numbers of categories efficiently. 


7.1.6 Exercises 


1. Assume that the size of the convolution kernel is A = 0. Show that in this case the 
convolution kernel implements an MLP independently for each set of channels. This 
leads to the Network in Network architectures (Lin et al., 2013). 


240 


Convolutional Neural Networks 


2. Audio data is often represented as a one-dimensional sequence. 
1. When might you want to impose locality and translation invariance for audio? 
2. Derive the convolution operations for audio. 


3. Can you treat audio using the same tools as computer vision? Hint: use the spectro- 
gram. 


3. Why might translation invariance not be a good idea after all? Give an example. 


4. Do you think that convolutional layers might also be applicable for text data? Which 
problems might you encounter with language? 


5. What happens with convolutions when an object is at the boundary of an image? 


6. Prove that the convolution is symmetric, i.e., f* g =g * f. 


Discussions !!9, 


7.2 Convolutions for Images 
SS, 


Now that we understand how convolutional layers work in theory, we are ready to see how 
they work in practice. Building on our motivation of convolutional neural networks as 
efficient architectures for exploring structure in image data, we stick with images as our 
running example. 


import torch 
from torch import nn 
from d21 import torch as d21 


7.2.1 The Cross-Correlation Operation 


Recall that strictly speaking, convolutional layers are a misnomer, since the operations they 
express are more accurately described as cross-correlations. Based on our descriptions of 
convolutional layers in Section 7.1, in such a layer, an input tensor and a kernel tensor are 
combined to produce an output tensor through a cross-correlation operation. 


Let’s ignore channels for now and see how this works with two-dimensional data and hidden 
representations. In Fig. 7.2.1, the input is a two-dimensional tensor with a height of 3 and 
width of 3. We mark the shape of the tensor as 3 x 3 or (3, 3). The height and width of the 
kernel are both 2. The shape of the kernel window (or convolution window) is given by the 
height and width of the kernel (here it is 2 x 2). 


In the two-dimensional cross-correlation operation, we begin with the convolution window 
positioned at the upper-left corner of the input tensor and slide it across the input tensor, 
both from left to right and top to bottom. When the convolution window slides to a certain 
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Kernel Output 


.oo. 


Two-dimensional cross-correlation operation. The shaded portions are the first output 
element as well as the input and kernel tensor elements used for the output computation: 
0x0+1x1+3x2+4x3= 19. 


position, the input subtensor contained in that window and the kernel tensor are multiplied 
elementwise and the resulting tensor is summed up yielding a single scalar value. This 
result gives the value of the output tensor at the corresponding location. Here, the output 
tensor has a height of 2 and width of 2 and the four elements are derived from the two- 
dimensional cross-correlation operation: 
0x0+1x1+3x2+4x3=19, 
1x0+2x1+4x2+5x3=25, 
3x0+4x1+6x2+7x3=37, 
4x0+5x1+7Xx2+8 x3 = 43. 


(7.2.1) 


Note that along each axis, the output size is slightly smaller than the input size. Because 
the kernel has width and height greater than 1, we can only properly compute the cross- 
correlation for locations where the kernel fits wholly within the image, the output size is 
given by the input size np X nw minus the size of the convolution kernel kp X kw via 


(my — kn + 1) X (nw — kw + 1). (7.2.2) 


This is the case since we need enough space to “shift” the convolution kernel across the 
image. Later we will see how to keep the size unchanged by padding the image with zeros 
around its boundary so that there is enough space to shift the kernel. Next, we implement 
this process in the corr2d function, which accepts an input tensor X and a kernel tensor K 
and returns an output tensor Y. 


def corr2d(X, K): #@save 
"""Compute 2D cross-correlation. 
h, w = K.shape 
Y = torch.zeros((X.shape[@] - h + 1, X.shape[1] - w + 1)) 
for i in range(Y.shape[Q]): 
for j in range(Y.shape[1]): 
YCi, j] = (X[i:i + h, j:j + w] * K).sum() 
return Y 


nnn 


We can construct the input tensor X and the kernel tensor K from Fig. 7.2.1 to validate 
the output of the above implementation of the two-dimensional cross-correlation opera- 
tion. 


X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.011) 


(continues on next page) 
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(continued from previous page) 


K = torch.tensor([Ll0.0, 1.0], [2.0, 3.0]]) 
corr2d(X, K) 


tensor([[19., 25.], 
[37 343571) 


7.2.2 Convolutional Layers 


A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce 
an output. The two parameters of a convolutional layer are the kernel and the scalar bias. 
When training models based on convolutional layers, we typically initialize the kernels 
randomly, just as we would with a fully connected layer. 


We are now ready to implement a two-dimensional convolutional layer based on the corr2d 
function defined above. In the __init__ constructor method, we declare weight and bias 
as the two model parameters. The forward propagation method calls the corr2d function 
and adds the bias. 


class Conv2D(nn.Module): 
def __init__(self, kernel_size): 
super().__init__Q 
self.weight = nn.Parameter(torch. rand(kernel_size)) 
self.bias = nn.Parameter(torch.zeros(1)) 


def forward(self, x): 
return corr2d(x, self.weight) + self.bias 


In hxw convolution or an hxw convolution kernel, the height and width of the convolution 
kernel are h and w, respectively. We also refer to a convolutional layer with an h X w 
convolution kernel simply as an h x w convolutional layer. 


7.2.3 Object Edge Detection in Images 


Let’s take a moment to parse a simple application of a convolutional layer: detecting the 
edge of an object in an image by finding the location of the pixel change. First, we construct 
an “image” of 6 x 8 pixels. The middle four columns are black (0) and the rest are white 


(1). 


X = torch.ones((6, 8)) 
X[:, 2:6] = @ 
X 


tensor ([[1., 1., @., Q. 
[1., 1., @. 


Ss 

a 

s9 
; ae 
eee 
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(continues on next page) 
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(continued from previous page) 


[1., 1., @, @, @, O., 1., 1.1, 
is Wine Oss, Oey Oxy Oy, Te, Ded; 
[1., 1., @, @, @, @, 1., 1.7) 


Next, we construct a kernel K with a height of 1 and a width of 2. When we perform 
the cross-correlation operation with the input, if the horizontally adjacent elements are 
the same, the output is 0. Otherwise, the output is nonzero. Note that this kernel is a 
special case of a finite difference operator. At location (i, j) it computes xi j — X(i+1),j» 
i.e., it computes the difference between the values of horizontally adjacent pixels. This is 
a discrete approximation of the first derivative in the horizontal direction. After all, for 
a function f(i, j) its derivative —ð; f (i, j) = lime—o ES EE 
works in practice. 


. Let’s see how this 


K = torch.tensor([[1.0, -1.0]]) 


We are ready to perform the cross-correlation operation with arguments X (our input) and 
K (our kernel). As you can see, we detect | for the edge from white to black and —1 for the 
edge from black to white. All other outputs take value 0. 


Y = corr2d(X, K) 
y 


tensor([L 0., 1., ©., ©., @., -1., 90.], 
[@., 1., @, ©, 0., -1., 0.], 
[@., 1., @, 0a ®., -1., 0], 
CO, 1., @, ©, 0., -1., @.], 
[@., 1., @, ©., 0., -1., 0.], 
[@., 1., @, ®., ®., -1., 0.11) 


We can now apply the kernel to the transposed image. As expected, it vanishes. The kernel 
K only detects vertical edges. 


corr2d(X.t(Q), K) 


tensor(L[[@., ©., ©., ©., 9.], 
[0., 0., ©., ©., 0], 
[0., 0., ©., ©., 0], 
[0., @., ©., ©., 0], 
o., @., ©., ©., 0], 
[0., @., @., Ori 0]; 
[0., @., ©., ©., 0], 
Eo., 0., ©., @, 0.11) 


7.2.4 Learning a Kernel 
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Designing an edge detector by finite differences [1, -1] is neat if we know this is precisely 
what we are looking for. However, as we look at larger kernels, and consider successive 
layers of convolutions, it might be impossible to specify precisely what each filter should 
be doing manually. 


Now let’s see whether we can learn the kernel that generated Y from X by looking at the 
input—output pairs only. We first construct a convolutional layer and initialize its kernel as 
arandom tensor. Next, in each iteration, we will use the squared error to compare Y with the 
output of the convolutional layer. We can then calculate the gradient to update the kernel. 
For the sake of simplicity, in the following we use the built-in class for two-dimensional 
convolutional layers and ignore the bias. 


# Construct a two-dimensional convolutional layer with 1 output channel and a 
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here 
conv2d = nn.LazyConv2d(1, kernel_size=(1, 2), bias=False) 


# The two-dimensional convolutional layer uses four-dimensional input and 

# output in the format of (example, channel, height, width), where the batch 

# size (number of examples in the batch) and the number of channels are both 1 
X = X.reshape((1, 1, 6, 8)) 

Y = Y.reshape((1, 1, 6, 7)) 

lr = 3e-2 # Learning rate 


for i in range(1Q): 
Y_hat = conv2d(X) 
1 = (Y_hat - Y) ** 2 
conv2d.zero_grad() 
1.sumQ) .backward() 
# Update the kernel 
conv2d.weight.data[:] -= lr * conv2d.weight.grad 
if (i + 1) %2 == 2: 
print(f’epoch {i + 1}, loss {l.sum():.3f}’) 


epoch 2, loss 16.481 
epoch 4, loss 5.069 
epoch 6, loss 1.794 
epoch 8, loss 0.688 
epoch 10, loss @.274 


Note that the error has dropped to a small value after 10 iterations. Now we will take a look 
at the kernel tensor we learned. 


conv2d.weight.data.reshape((1, 2)) 


tensor([L 1.0398, -0.9328]]) 


Indeed, the learned kernel tensor is remarkably close to the kernel tensor K we defined 
earlier. 
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7.2.5 Cross-Correlation and Convolution 


Recall our observation from Section 7.1 of the correspondence between the cross-correlation 
and convolution operations. Here let’s continue to consider two-dimensional convolutional 

layers. What if such layers perform strict convolution operations as defined in (7.1.6) in- 

stead of cross-correlations? In order to obtain the output of the strict convolution operation, 

we only need to flip the two-dimensional kernel tensor both horizontally and vertically, and 

then perform the cross-correlation operation with the input tensor. 


It is noteworthy that since kernels are learned from data in deep learning, the outputs of 
convolutional layers remain unaffected no matter such layers perform either the strict con- 
volution operations or the cross-correlation operations. 


To illustrate this, suppose that a convolutional layer performs cross-correlation and learns 
the kernel in Fig. 7.2.1, which is here denoted as the matrix K. Assuming that other con- 
ditions remain unchanged, when this layer instead performs strict convolution, the learned 
kernel K’ will be the same as K after K’ is flipped both horizontally and vertically. That 
is to say, when the convolutional layer performs strict convolution for the input in Fig. 
7.2.1 and K’, the same output in Fig. 7.2.1 (cross-correlation of the input and K) will be 
obtained. 


In keeping with standard terminology in deep learning literature, we will continue to refer to 
the cross-correlation operation as a convolution even though, strictly-speaking, it is slightly 
different. Furthermore, we use the term element to refer to an entry (or component) of any 
tensor representing a layer representation or a convolution kernel. 


7.2.6 Feature Map and Receptive Field 


As described in Section 7.1.4, the convolutional layer output in Fig. 7.2.1 is sometimes 
called a feature map, as it can be regarded as the learned representations (features) in the 
spatial dimensions (e.g., width and height) to the subsequent layer. In CNNs, for any el- 
ement x of some layer, its receptive field refers to all the elements (from all the previous 
layers) that may affect the calculation of x during the forward propagation. Note that the 
receptive field may be larger than the actual size of the input. 


Let’s continue to use Fig. 7.2.1 to explain the receptive field. Given the 2 x 2 convolution 
kernel, the receptive field of the shaded output element (of value 19) is the four elements 
in the shaded portion of the input. Now let’s denote the 2 x 2 output as Y and consider a 
deeper CNN with an additional 2x2 convolutional layer that takes Y as its input, outputting 
a single element z. In this case, the receptive field of z on Y includes all the four elements 
of Y, while the receptive field on the input includes all the nine input elements. Thus, when 
any element in a feature map needs a larger receptive field to detect input features over a 
broader area, we can build a deeper network. 


Receptive fields derive their name from neurophysiology. A series of experiments on a 
range of animals using different stimuli (Hubel and Wiesel, 1959, Hubel and Wiesel, 1962, 
Hubel and Wiesel, 1968) explored the response of what is called the visual cortex on said 
stimuli. By and large they found that lower levels respond to edges and related shapes. 
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Later on, Field (1987) illustrated this effect on natural images with, what can only be called, 
convolutional kernels. We reprint a key figure in Fig. 7.2.2 to illustrate the striking simi- 
larities. 


Figure and caption taken from Field (1987): An example of coding with six different 
channels. (Left) Examples of the six types of sensor associated with each channel. (Right) 
Convolution of the image in (Middle) with the six sensors shown in (Left). The response 
of the individual sensors is determined by sampling these filtered images at a distance 
proportional to the size of the sensor (shown with dots). This diagram shows the response 
of only the even symmetric sensors. 


As it turns out, this relation even holds for the features computed by deeper layers of net- 
works trained on image classification tasks, as demonstrated in, for example, Kuzovkin et 
al. (2018). Suffice it to say, convolutions have proven to be an incredibly powerful tool for 
computer vision, both in biology and in code. As such, it is not surprising (in hindsight) 
that they heralded the recent success in deep learning. 


7.2.7 Summary 


The core computation required for a convolutional layer is a cross-correlation operation. 
We saw that a simple nested for-loop is all that is required to compute its value. If we 
have multiple input and multiple output channels, we are performing a matrix—matrix op- 
eration between channels. As can be seen, the computation is straightforward and, most 
importantly, highly local. This affords significant hardware optimization and many recent 
results in computer vision are only possible because of that. After all, it means that chip 
designers can invest in fast computation rather than memory when it comes to optimizing 
for convolutions. While this may not lead to optimal designs for other applications, it does 
open the door to ubiquitous and affordable computer vision. 
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In terms of convolutions themselves, they can be used for many purposes, for example 
detecting edges and lines, blurring images, or sharpening them. Most importantly, it is 
not necessary that the statistician (or engineer) invents suitable filters. Instead, we can 
simply learn them from data. This replaces feature engineering heuristics by evidence- 
based statistics. Lastly, and quite delightfully, these filters are not just advantageous for 
building deep networks but they also correspond to receptive fields and feature maps in the 
brain. This gives us confidence that we are on the right track. 


7.2.8 Exercises 
1. Construct an image X with diagonal edges. 
1. What happens if you apply the kernel K in this section to it? 
2. What happens if you transpose X? 
3. What happens if you transpose K? 
2. Design some kernels manually. 


1. Given a directional vector v = (v1, v2), derive an edge-detection kernel that detects 
edges orthogonal to v, i.e., edges in the direction (v2, —v1). 


2. Derive a finite difference operator for the second derivative. What is the minimum 
size of the convolutional kernel associated with it? Which structures in images re- 
spond most strongly to it? 


3. How would you design a blur kernel? Why might you want to use such a kernel? 
4. What is the minimum size of a kernel to obtain a derivative of order d? 


3. When you try to automatically find the gradient for the Conv2D class we created, what 
kind of error message do you see? 


4. How do you represent a cross-correlation operation as a matrix multiplication by chang- 
ing the input and kernel tensors? 


Discussions !7°. 


7.3 Padding and Stride 
T] 


Recall the example of a convolution in Fig. 7.2.1. The input had both a height and width of 
3 and the convolution kernel had both a height and width of 2, yielding an output represen- 
tation with dimension 2 x 2. Assuming that the input shape is np X nw and the convolution 
kernel shape is ky X kw, the output shape will be (mp — kn + 1) X (nw — kw + 1): we can 
only shift the convolution kernel so far until it runs out of pixels to apply the convolution 
to. 
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In the following we will explore a number of techniques, including padding and strided 
convolutions, that offer more control over the size of the output. As motivation, note that 
since kernels generally have width and height greater than 1, after applying many successive 
convolutions, we tend to wind up with outputs that are considerably smaller than our input. 
If we start with a 240 x 240 pixel image, ten layers of 5 x 5 convolutions reduce the image 
to 200 x 200 pixels, slicing off 30% of the image and with it obliterating any interesting 
information on the boundaries of the original image. Padding is the most popular tool for 
handling this issue. In other cases, we may want to reduce the dimensionality drastically, 
e.g., if we find the original input resolution to be unwieldy. Strided convolutions are a 
popular technique that can help in these instances. 


import torch 
from torch import nn 


7.3.1 Padding 


As described above, one tricky issue when applying convolutional layers is that we tend 
to lose pixels on the perimeter of our image. Consider Fig. 7.3.1 that depicts the pixel 
utilization as a function of the convolution kernel size and the position within the image. 
The pixels in the corners are hardly used at all. 
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Pixel utilization for convolutions of size 1 x 1, 2 x 2, and 3 x 3 respectively. 


Since we typically use small kernels, for any given convolution we might only lose a few 
pixels but this can add up as we apply many successive convolutional layers. One straight- 
forward solution to this problem is to add extra pixels of filler around the boundary of our 
input image, thus increasing the effective size of the image. Typically, we set the values of 
the extra pixels to zero. In Fig. 7.3.2, we pad a 3 x 3 input, increasing its size to 5 x 5. The 
corresponding output then increases to a4 x4 matrix. The shaded portions are the first out- 
put element as well as the input and kernel tensor elements used for the output computation: 
0x0+0x1+0x24+0x3=0. 
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Two-dimensional cross-correlation with padding. 
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In general, if we add a total of pp rows of padding (roughly half on top and half on bottom) 
and a total of pw columns of padding (roughly half on the left and half on the right), the 
output shape will be 


(my — kn + pn + 1) x (nw — kw + pw + 1). (7.3.1) 


This means that the height and width of the output will increase by pp and pw, respec- 
tively. 


In many cases, we will want to set pp = kpn — 1 and py = ky — | to give the input and 
output the same height and width. This will make it easier to predict the output shape of 
each layer when constructing the network. Assuming that kp is odd here, we will pad pp/2 
rows on both sides of the height. If kp is even, one possibility is to pad [pn/2] rows on the 
top of the input and | pn/2] rows on the bottom. We will pad both sides of the width in the 
same way. 


CNNs commonly use convolution kernels with odd height and width values, such as 1, 3, 
5, or 7. Choosing odd kernel sizes has the benefit that we can preserve the dimensionality 
while padding with the same number of rows on top and bottom, and the same number of 
columns on left and right. 


Moreover, this practice of using odd kernels and padding to precisely preserve dimension- 
ality offers a clerical benefit. For any two-dimensional tensor X, when the kernel’s size is 
odd and the number of padding rows and columns on all sides are the same, thereby pro- 
ducing an output with the same height and width as the input, we know that the output Y[i, 
j] is calculated by cross-correlation of the input and convolution kernel with the window 
centered on X[i, j]. 


In the following example, we create a two-dimensional convolutional layer with a height 
and width of 3 and apply 1 pixel of padding on all sides. Given an input with a height and 
width of 8, we find that the height and width of the output is also 8. 


# We define a helper function to calculate convolutions. It initializes the 
# convolutional layer weights and performs corresponding dimensionality 
# elevations and reductions on the input and output 
def comp_conv2d(conv2d, X): 
# (1, 1) indicates that batch size and the number of channels are both 1 
X = X.reshape((1, 1) + X.shape) 
Y = conv2d(X) 
# Strip the first two dimensions: examples and channels 
return Y.reshape(Y.shapeL2:]) 


# 1 row and column is padded on either side, so a total of 2 rows or columns 
# are added 

conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1) 

X = torch.rand(size=(8, 8)) 

comp_conv2d(conv2d, X).shape 


torch.Size([8, 8]) 
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When the height and width of the convolution kernel are different, we can make the output 
and input have the same height and width by setting different padding numbers for height 
and width. 


# We use a convolution kernel with height 5 and width 3. The padding on either 
# side of the height and width are 2 and 1, respectively 

conv2d = nn.LazyConv2d(1, kernel_size=(5, 3), padding=(2, 1)) 
comp_conv2d(conv2d, X).shape 


torch.Size([8, 8]) 


7.3.2 Stride 


When computing the cross-correlation, we start with the convolution window at the upper- 
left corner of the input tensor, and then slide it over all locations both down and to the 
right. In the previous examples, we defaulted to sliding one element at a time. However, 
sometimes, either for computational efficiency or because we wish to downsample, we 
move our window more than one element at a time, skipping the intermediate locations. 
This is particularly useful if the convolution kernel is large since it captures a large area of 
the underlying image. 


We refer to the number of rows and columns traversed per slide as stride. So far, we have 
used strides of 1, both for height and width. Sometimes, we may want to use a larger stride. 
Fig. 7.3.3 shows a two-dimensional cross-correlation operation with a stride of 3 vertically 
and 2 horizontally. The shaded portions are the output elements as well as the input and 
kernel tensor elements used for the output computation: 0X0+0x1+1x*2+2x3=8, 
0x0+6x 1+0xX2+0x3 = 6. We can see that when the second element of the first column is 
generated, the convolution window slides down three rows. The convolution window slides 
two columns to the right when the second element of the first row is generated. When the 
convolution window continues to slide two columns to the right on the input, there is no 
output because the input element cannot fill the window (unless we add another column of 
padding). 


Kernel Output 


Cross-correlation with strides of 3 and 2 for height and width, respectively. 


In general, when the stride for the height is sh and the stride for the width is sy, the output 
shape is 


L(mp — kn + Pn + Sn) /sn] X L(nw — kw + Pw + Sw) /Sw). (7.3.2) 


If we set py = ky — 1 and py = ky — 1, then the output shape can be simplified to | (mp, + 
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Sh — 1)/sh] X L(nw + sw — 1)/sw]. Going a step further, if the input height and width are 
divisible by the strides on the height and width, then the output shape will be (mp/sp) x 


(nw / Sw). 


Below, we set the strides on both the height and width to 2, thus halving the input height 
and width. 


conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2) 
comp_conv2d(conv2d, X).shape 


torch.Size([4, 4]) 


Let’s look at a slightly more complicated example. 


conv2d = nn.LazyConv2d(1, kernel_size=(3, 5), padding=(@, 1), stride=(3, 4)) 
comp_conv2d(conv2d, X).shape 


torch.Size([2, 2]) 


7.3.3 Summary and Discussion 


Padding can increase the height and width of the output. This is often used to give the 
output the same height and width as the input to avoid undesirable shrinkage of the output. 
Moreover, it ensures that all pixels are used equally frequently. Typically we pick symmetric 
padding on both sides of the input height and width. In this case we refer to (pn, Pw) 
padding. Most commonly we set pp = Pw, in which case we simply state that we choose 
padding p. 


A similar convention applies to strides. When horizontal stride sp and vertical stride sy, 
match, we simply talk about stride s. The stride can reduce the resolution of the output, for 
example reducing the height and width of the output to only 1/n of the height and width of 
the input for n > 1. By default, the padding is 0 and the stride is 1. 


So far all padding that we discussed simply extended images with zeros. This has signif- 
icant computational benefit since it is trivial to accomplish. Moreover, operators can be 
engineered to take advantage of this padding implicitly without the need to allocate addi- 
tional memory. At the same time, it allows CNNs to encode implicit position information 
within an image, simply by learning where the “whitespace” is. There are many alternatives 
to zero-padding. Alsallakh et al. (2020) provided an extensive overview of those (albeit 
without a clear case for when to use nonzero paddings unless artifacts occur). 


7.3.4 Exercises 


1. Given the final code example in this section with kernel size (3,5), padding (0, 1), and 
stride (3, 4), calculate the output shape to check if it is consistent with the experimental 
result. 
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2. For audio signals, what does a stride of 2 correspond to? 


3. Implement mirror padding, i.e., padding where the border values are simply mirrored 
to extend tensors. 


4. What are the computational benefits of a stride larger than 1? 
5. What might be statistical benefits of a stride larger than 1? 


6. How would you implement a stride of 5? What does it correspond to? When would this 
be useful? 


Discussions !?!. 


7.4 Multiple Input and Multiple Output Channels 
| 


While we described the multiple channels that comprise each image (e.g., color images 
have the standard RGB channels to indicate the amount of red, green and blue) and con- 
volutional layers for multiple channels in Section 7.1.4, until now, we simplified all of our 
numerical examples by working with just a single input and a single output channel. This 
allowed us to think of our inputs, convolution kernels, and outputs each as two-dimensional 
tensors. 


When we add channels into the mix, our inputs and hidden representations both become 
three-dimensional tensors. For example, each RGB input image has shape 3 x h x w. We 
refer to this axis, with a size of 3, as the channel dimension. The notion of channels is 
as old as CNNs themselves: for instance LeNet-5 (LeCun et al., 1995) uses them. In this 
section, we will take a deeper look at convolution kernels with multiple input and multiple 
output channels. 


import torch 
from d21 import torch as d21 


7.4.1 Multiple Input Channels 


When the input data contains multiple channels, we need to construct a convolution kernel 
with the same number of input channels as the input data, so that it can perform cross- 
correlation with the input data. Assuming that the number of channels for the input data 
is cj, the number of input channels of the convolution kernel also needs to be cj. If our 
convolution kernel’s window shape is ky X kw, then, when cj = 1, we can think of our 
convolution kernel as just a two-dimensional tensor of shape kp X kw. 


However, when cj > 1, we need a kernel that contains a tensor of shape ky X kw for ev- 
ery input channel. Concatenating these cj; tensors together yields a convolution kernel of 
shape cj X kp X ky. Since the input and convolution kernel each have c; channels, we can 
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perform a cross-correlation operation on the two-dimensional tensor of the input and the 
two-dimensional tensor of the convolution kernel for each channel, adding the ci results 
together (summing over the channels) to yield a two-dimensional tensor. This is the result 
of a two-dimensional cross-correlation between a multi-channel input and a multi-input- 
channel convolution kernel. 


Fig. 7.4.1 provides an example of a two-dimensional cross-correlation with two input chan- 
nels. The shaded portions are the first output element as well as the input and kernel tensor 
elements used for the output computation: (1 x 1+2*2+4x*345x4)+(Ox04+1x1+ 
3x24+4x3) =56. 
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Cross-correlation computation with two input channels. 


To make sure we really understand what is going on here, we can implement cross-correlation 
operations with multiple input channels ourselves. Notice that all we are doing is perform- 
ing a cross-correlation operation per channel and then adding up the results. 


def corr2d_multi_in(X, K): 
# Iterate through the @th dimension (channel) of K first, then add them up 
return sum(d2l1.corr2d(x, k) for x, k in zip(X, K)) 


We can construct the input tensor X and the kernel tensor K corresponding to the values in 
Fig. 7.4.1 to validate the output of the cross-correlation operation. 


X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]], 
DEL O 2.0, 3.01, Eo, 5.0, GO, (7.4, 8.0, 2.) 
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]]) 


corr2d_multi_in(X, K) 


tensor([[ 56., 72.], 
[104., 120.]]) 


7.4.2 Multiple Output Channels 


Regardless of the number of input channels, so far we always ended up with one output 
channel. However, as we discussed in Section 7.1.4, it turns out to be essential to have 
multiple channels at each layer. In the most popular neural network architectures, we actu- 
ally increase the channel dimension as we go deeper in the neural network, typically down- 
sampling to trade off spatial resolution for greater channel depth. Intuitively, you could 
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think of each channel as responding to a different set of features. The reality is a bit more 
complicated than this. A naive interpretation would suggest that representations are learned 
independently per pixel or per channel. Instead, channels are optimized to be jointly useful. 
This means that rather than mapping a single channel to an edge detector, it may simply 
mean that some direction in channel space corresponds to detecting edges. 


Denote by ci and co the number of input and output channels, respectively, and by kp and ky 
the height and width of the kernel. To get an output with multiple channels, we can create 
a kernel tensor of shape cj X ky X kw for every output channel. We concatenate them on the 
output channel dimension, so that the shape of the convolution kernel is Co X ci X kp X kw. 
In cross-correlation operations, the result on each output channel is calculated from the 
convolution kernel corresponding to that output channel and takes input from all channels 
in the input tensor. 


We implement a cross-correlation function to calculate the output of multiple channels as 
shown below. 


def corr2d_multi_in_out(X, K): 
# Iterate through the @th dimension of K, and each time, perform 
# cross-correlation operations with input X. All of the results are 
# stacked together 
return torch.stack(Lcorr2d_multi_in(X, k) for k in K], 2) 


We construct a trivial convolution kernel with three output channels by concatenating the 
kernel tensor for K with K+1 and K+2. 


K = torch.stack((K, K + 1, K + 2), 0) 
K. shape 


torch.Size([3, 2, 2, 2]) 


Below, we perform cross-correlation operations on the input tensor X with the kernel tensor 
K. Now the output contains three channels. The result of the first channel is consistent with 
the result of the previous input tensor X and the multi-input channel, single-output channel 
kernel. 


corr2d_multi_in_out(X, K) 


tensor([L[ 56., 72.], 
[104., 120.]], 
CE 76., 100.], 
[148., 172.]], 
[E 96., 128.1, 
[192., 224.]]]) 


255 


Multiple Input and Multiple Output Channels 


7.4.3 1 x 1 Convolutional Layer 


At first, a 1 x 1 convolution, i.e., ky = ky = 1, does not seem to make much sense. 
After all, a convolution correlates adjacent pixels. A 1 x 1 convolution obviously does 
not. Nonetheless, they are popular operations that are sometimes included in the designs 
of complex deep networks (Lin et al., 2013, Szegedy et al., 2017). Let’s see in some detail 
what it actually does. 


Because the minimum window is used, the 1 x 1 convolution loses the ability of larger con- 
volutional layers to recognize patterns consisting of interactions among adjacent elements 
in the height and width dimensions. The only computation of the 1 x 1 convolution occurs 
on the channel dimension. 


Fig. 7.4.2 shows the cross-correlation computation using the | x 1 convolution kernel with 3 
input channels and 2 output channels. Note that the inputs and outputs have the same height 
and width. Each element in the output is derived from a linear combination of elements at 
the same position in the input image. You could think of the 1 x 1 convolutional layer as 
constituting a fully connected layer applied at every single pixel location to transform the 
cj corresponding input values into co output values. Because this is still a convolutional 
layer, the weights are tied across pixel location. Thus the 1 x 1 convolutional layer requires 
Co X cj weights (plus the bias). Also note that convolutional layers are typically followed 
by nonlinearities. This ensures that 1 x 1 convolutions cannot simply be folded into other 
convolutions. 


Input Kernel Output 


The cross-correlation computation uses the 1 x 1 convolution kernel with three input 
channels and two output channels. The input and output have the same height and width. 


Let’s check whether this works in practice: we implement a 1 x 1 convolution using a fully 
connected layer. The only thing is that we need to make some adjustments to the data shape 
before and after the matrix multiplication. 


def corr2d_multi_in_out_1x1(X, K): 
c_i, h, w = X.shape 
c_o = K.shape[Q] 
= X.reshape((c_i, h * w)) 
= K.reshape((c_o, c_i)) 
# Matrix multiplication in the fully connected layer 
Y = torch.matmul(K, X) 
return Y.reshape((c_o, h, w)) 


A X< 
ot 


When performing | x 1 convolutions, the above function is equivalent to the previously im- 
plemented cross-correlation function corr2d_multi_in_out. Let’s check this with some 
sample data. 
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X = torch.normal(@, 1, (3, 3, 3)) 

K = torch.normal(@, 1, (2, 3, 1, 1)) 

Y1 = corr2d_multi_in_out_1x1(X, K) 

Y2 = corr2d_multi_in_out(X, K) 

assert float(torch.abs(Y1 - Y2).sum()) < le-6 


oul 


7.4.4 Discussion 


Channels allow us to combine the best of both worlds: MLPs that allow for significant 
nonlinearities and convolutions that allow for localized analysis of features. In particular, 
channels allow the CNN to reason with multiple features, such as edge and shape detec- 
tors at the same time. They also offer a practical trade-off between the drastic parameter 
reduction arising from translation invariance and locality, and the need for expressive and 
diverse models in computer vision. 


Note, though, that this flexibility comes at a price. Given an image of size (h x w), the cost 
for computing a k x k convolution is O(h- w - k*). For ci and co input and output channels 
respectively this increases to O(h- w+ k? - ci + Co). For a 256 x 256 pixel image with a 
5 x5 kernel and 128 input and output channels respectively this amounts to over 53 billion 
operations (we count multiplications and additions separately). Later on we will encounter 
effective strategies to cut down on the cost, e.g., by requiring the channel-wise operations 
to be block-diagonal, leading to architectures such as ResNeXt (Xie et al., 2017). 


7.4.5 Exercises 


1. Assume that we have two convolution kernels of size kı and k2, respectively (with no 
nonlinearity in between). 


1. Prove that the result of the operation can be expressed by a single convolution. 
2. What is the dimensionality of the equivalent single convolution? 


3. Is the converse true, i.e., can you always decompose a convolution into two smaller 
ones? 


2. Assume an input of shape cj x h x w and a convolution kernel of shape co X ci X kp X kw, 
padding of (pn, pw), and stride of (sp, sw). 


1. What is the computational cost (multiplications and additions) for the forward prop- 
agation? 


2. What is the memory footprint? 
3. What is the memory footprint for the backward computation? 
4. What is the computational cost for the backpropagation? 


3. By what factor does the number of calculations increase if we double both the number 
of input channels ci and the number of output channels co? What happens if we double 
the padding? 
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4. Are the variables Y1 and Y2 in the final example of this section exactly the same? Why? 


5. Express convolutions as a matrix multiplication, even when the convolution window is 
not 1 x 1. 


6. Your task is to implement fast convolutions with a k x k kernel. One of the algorithm 
candidates is to scan horizontally across the source, reading a k-wide strip and comput- 
ing the 1-wide output strip one value at a time. The alternative is to read a k + A wide 
strip and compute a A-wide output strip. Why is the latter preferable? Is there a limit to 
how large you should choose A? 


7. Assume that we have ac x c matrix. 


1. How much faster is it to multiply with a block-diagonal matrix if the matrix is broken 
up into b blocks? 


2. What is the downside of having b blocks? How could you fix it, at least partly? 


Discussions !??. 


7.5 Pooling 


In many cases our ultimate task asks some global question about the image, e.g., does it 
contain a cat? Consequently, the units of our final layer should be sensitive to the entire 
input. By gradually aggregating information, yielding coarser and coarser maps, we ac- 
complish this goal of ultimately learning a global representation, while keeping all of the 
advantages of convolutional layers at the intermediate layers of processing. The deeper 
we go in the network, the larger the receptive field (relative to the input) to which each 
hidden node is sensitive. Reducing spatial resolution accelerates this process, since the 
convolution kernels cover a larger effective area. 


Moreover, when detecting lower-level features, such as edges (as discussed in Section 7.2), 
we often want our representations to be somewhat invariant to translation. For instance, 
if we take the image X with a sharp delineation between black and white and shift the 
whole image by one pixel to the right, i.e., Z[i, j] = XLi, j + 1], then the output 
for the new image Z might be vastly different. The edge will have shifted by one pixel. In 
reality, objects hardly ever occur exactly at the same place. In fact, even with a tripod and 
a stationary object, vibration of the camera due to the movement of the shutter might shift 
everything by a pixel or so (high-end cameras are loaded with special features to address 
this problem). 


This section introduces pooling layers, which serve the dual purposes of mitigating the 
sensitivity of convolutional layers to location and of spatially downsampling representa- 
tions. 
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import torch 
from torch import nn 
from d21 import torch as d21 


7.5.1 Maximum Pooling and Average Pooling 


Like convolutional layers, pooling operators consist of a fixed-shape window that is slid 
over all regions in the input according to its stride, computing a single output for each lo- 
cation traversed by the fixed-shape window (sometimes known as the pooling window). 
However, unlike the cross-correlation computation of the inputs and kernels in the con- 
volutional layer, the pooling layer contains no parameters (there is no kernel). Instead, 
pooling operators are deterministic, typically calculating either the maximum or the aver- 
age value of the elements in the pooling window. These operations are called maximum 
pooling (max-pooling for short) and average pooling, respectively. 


Average pooling is essentially as old as CNNs. The idea is akin to downsampling an image. 
Rather than just taking the value of every second (or third) pixel for the lower resolution 
image, we can average over adjacent pixels to obtain an image with better signal-to-noise 
ratio since we are combining the information from multiple adjacent pixels. Max-pooling 
was introduced in Riesenhuber and Poggio (1999) in the context of cognitive neuroscience 
to describe how information aggregation might be aggregated hierarchically for the purpose 
of object recognition; there already was an earlier version in speech recognition (Yamaguchi 
et al., 1990). In almost all cases, max-pooling, as it is also referred to, is preferable to 
average pooling. 


In both cases, as with the cross-correlation operator, we can think of the pooling window 
as starting from the upper-left of the input tensor and sliding across it from left to right and 
top to bottom. At each location that the pooling window hits, it computes the maximum or 
average value of the input subtensor in the window, depending on whether max or average 
pooling is employed. 


Input Output 
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Max-pooling with a pooling window shape of 2 x 2. The shaded portions are the first 
output element as well as the input tensor elements used for the output computation: 
max(0, 1,3,4) = 4. 


The output tensor in Fig. 7.5.1 has a height of 2 and a width of 2. The four elements are 
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derived from the maximum value in each pooling window: 


max(0, 1,3,4) = 4, 
max(1,2,4,5) =5, 
max(3,4, 6,7) =7, 
max(4, 5,7,8) = 8. 


(7.5.1) 


More generally, we can define a p x q pooling layer by aggregating over a region of said 
size. Returning to the problem of edge detection, we use the output of the convolutional 
layer as input for 2 x 2 max-pooling. Denote by X the input of the convolutional layer input 
and Y the pooling layer output. Regardless of whether or not the values of XLi, j], XLi, 
j + 1], X[i+1, jj and X[i+1, j + 1] are different, the pooling layer always outputs 
Y[i, j] = 1. That is to say, using the 2 x 2 max-pooling layer, we can still detect if the 
pattern recognized by the convolutional layer moves no more than one element in height or 
width. 


In the code below, we implement the forward propagation of the pooling layer in the pool2d 
function. This function is similar to the corr2d function in Section 7.2. However, no kernel 
is needed, computing the output as either the maximum or the average of each region in the 
input. 


def pool2d(X, pool_size, mode='max'): 
p_h, p_w = pool_size 
Y = torch.zeros((X.shapelQ] - p_h + 1, X.shapel1] - pw + 1)) 
for i in range(Y.shape[Q]): 
for j in range(Y.shape[1]): 


if mode == ‘max’: 
YCi, j] = XLi: i + p_h, j: j + p_w].maxQ) 
elif mode == ‘avg’: 


YCi, j] = XLi: i + p_h, j: j + p_w].mean() 
return Y 


We can construct the input tensor X in Fig. 7.5.1 to validate the output of the two-dimensional 
max-pooling layer. 


X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]]) 
pool2d(X, (2, 2)) 


tensor([[4., 5.], 
[7., 8.]]) 


Also, we can experiment with the average pooling layer. 


pool2d(X, (2, 2), ‘avg') 


tensor([[2., 3.], 
[5., 6.]]) 
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7.5.2 Padding and Stride 


As with convolutional layers, pooling layers change the output shape. And as before, we can 
adjust the operation to achieve a desired output shape by padding the input and adjusting the 
stride. We can demonstrate the use of padding and strides in pooling layers via the built-in 
two-dimensional max-pooling layer from the deep learning framework. We first construct 
an input tensor X whose shape has four dimensions, where the number of examples (batch 
size) and number of channels are both 1. 


X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4)) 
X 


tensor([L[[ 9., 1., 2., 3.], 
EAr De boss Ael; 
[ 8., 9., 10., 11.], 
[12., 13., 14., 15.]]]J) 


Since pooling aggregates information from an area, deep learning frameworks default to 
matching pooling window sizes and stride. For instance, if we use a pooling window of 
shape (3, 3) we get a stride shape of (3, 3) by default. 


pool2d = nn.MaxPool2d(3) 
# Pooling has no model parameters, hence it needs no initialization 
pool2d(X) 


tensor ([[[[19.1]]]1]) 


Needless to say, the stride and padding can be manually specified to override framework 
defaults if required. 


pool2d = nn.MaxPool2d(3, padding=1, stride=2) 
pool2d(X) 


tensor (CCCL 5., 7.], 
[13., 15.]]]]) 


Of course, we can specify an arbitrary rectangular pooling window with arbitrary height 
and width respectively, as the example below shows. 


pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1)) 
pool2d(X) 


tensor(LLEL 5., 7.], 
[13., 15.]]]]) 
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7.5.3 Multiple Channels 


When processing multi-channel input data, the pooling layer pools each input channel sep- 
arately, rather than summing the inputs up over channels as in a convolutional layer. This 
means that the number of output channels for the pooling layer is the same as the number of 
input channels. Below, we will concatenate tensors X and X + 1 on the channel dimension 
to construct an input with two channels. 


X = torch.cat((X, X + 1), 1) 

X 

tensor ([C[C[C 9., 1., 2., 3.], 
C4: Soa Gas Tel; 
Der “Ges: sy I] 
[12., 13., 14., 15.1], 
CE fe 265 3a 4.], 
E Sra “Bag Fog. “Sky 
[9., 10., 11., 12.], 
[13., 14., 15., 16.1]]]) 


As we can see, the number of output channels is still two after pooling. 


pool2d = nn.MaxPool2d(3, padding=1, stride=2) 
pool2d(X) 


tensor (CCLC 5., 7.], 
Ets 1541]; 


[E 6., 8.], 
[14., 16.]]]]) 


7.5.4 Summary 


Pooling is an exceedingly simple operation. It does exactly what its name indicates, ag- 
gregate results over a window of values. All convolution semantics, such as strides and 
padding apply in the same way as they did previously. Note that pooling is indifferent to 
channels, i.e., it leaves the number of channels unchanged and it applies to each channel 
separately. Lastly, of the two popular pooling choices, max-pooling is preferable to average 
pooling, as it confers some degree of invariance to output. A popular choice is to pick a 
pooling window size of 2 x 2 to quarter the spatial resolution of output. 


Note that there are many more ways of reducing resolution beyond pooling. For instance, in 
stochastic pooling (Zeiler and Fergus, 2013) and fractional max-pooling (Graham, 2014) 
aggregation is combined with randomization. This can slightly improve the accuracy in 
some cases. Lastly, as we will see later with the attention mechanism, there are more 
refined ways of aggregating over outputs, e.g., by using the alignment between a query and 
representation vectors. 
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7.5.5 Exercises 
1. Implement average pooling through a convolution. 
2. Prove that max-pooling cannot be implemented through a convolution alone. 
3. Max-pooling can be accomplished using ReLU operations, i.e., RELU(x) = max(0, x). 
1. Express max(a, b) by using only ReLU operations. 
2. Use this to implement max-pooling by means of convolutions and ReLU layers. 


3. How many channels and layers do you need for a 2 x 2 convolution? How many for 
a3 x 3 convolution? 


4. What is the computational cost of the pooling layer? Assume that the input to the pooling 
layer is of size c X h x w, the pooling window has a shape of pn X Pw with a padding of 
(Ph, Pw) and a stride of (Sh, Sw). 


5. Why do you expect max-pooling and average pooling to work differently? 


6. Do we need a separate minimum pooling layer? Can you replace it with another opera- 
tion? 


7. We could use the softmax operation for pooling. Why might it not be so popular? 


Discussions !?°. 


7.6 Convolutional Neural Networks (LeNet) 
SS SS ae 


We now have all the ingredients required to assemble a fully-functional CNN. In our earlier 
encounter with image data, we applied a linear model with softmax regression (Section 4.4) 
and an MLP (Section 5.2) to pictures of clothing in the Fashion-MNIST dataset. To make 
such data amenable we first flattened each image from a 28 x 28 matrix into a fixed-length 
784-dimensional vector, and thereafter processed them in fully connected layers. Now that 
we have a handle on convolutional layers, we can retain the spatial structure in our images. 
As an additional benefit of replacing fully connected layers with convolutional layers, we 
will enjoy more parsimonious models that require far fewer parameters. 


In this section, we will introduce LeNet, among the first published CNNs to capture wide 
attention for its performance on computer vision tasks. The model was introduced by (and 
named for) Yann LeCun, then a researcher at AT&T Bell Labs, for the purpose of rec- 
ognizing handwritten digits in images (LeCun ef al., 1998). This work represented the 
culmination of a decade of research developing the technology; LeCun’s team published 
the first study to successfully train CNNs via backpropagation (LeCun et al., 1989). 


At the time LeNet achieved outstanding results matching the performance of support vector 
machines, then a dominant approach in supervised learning, achieving an error rate of less 
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than 1% per digit. LeNet was eventually adapted to recognize digits for processing deposits 
in ATM machines. To this day, some ATMs still run the code that Yann LeCun and his 
colleague Leon Bottou wrote in the 1990s! 


import torch 
from torch import nn 
from d21 import torch as d21 


7.6.1 LeNet 


At a high level, LeNet (LeNet-5) consists of two parts: (i) a convolutional encoder consist- 
ing of two convolutional layers; and (ii) a dense block consisting of three fully connected 
layers. The architecture is summarized in Fig. 7.6.1. 
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| Data flow in LeNet. The input is a handwritten digit, the output is a probability over 10 


possible outcomes. 


The basic units in each convolutional block are a convolutional layer, a sigmoid activation 
function, and a subsequent average pooling operation. Note that while ReLUs and max- 
pooling work better, they had not yet been discovered. Each convolutional layer uses a5 x5 
kernel and a sigmoid activation function. These layers map spatially arranged inputs to a 
number of two-dimensional feature maps, typically increasing the number of channels. The 
first convolutional layer has 6 output channels, while the second has 16. Each 2 x 2 pooling 
operation (stride 2) reduces dimensionality by a factor of 4 via spatial downsampling. The 
convolutional block emits an output with shape given by (batch size, number of channel, 
height, width). 


In order to pass output from the convolutional block to the dense block, we must flatten each 
example in the minibatch. In other words, we take this four-dimensional input and transform 
it into the two-dimensional input expected by fully connected layers: as a reminder, the two- 
dimensional representation that we desire uses the first dimension to index examples in the 
minibatch and the second to give the flat vector representation of each example. LeNet’s 
dense block has three fully connected layers, with 120, 84, and 10 outputs, respectively. 
Because we are still performing classification, the 10-dimensional output layer corresponds 
to the number of possible output classes. 
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While getting to the point where you truly understand what is going on inside LeNet may 
have taken a bit of work, we hope that the following code snippet will convince you that 
implementing such models with modern deep learning frameworks is remarkably simple. 
We need only to instantiate a Sequential block and chain together the appropriate layers, 
using Xavier initialization as introduced in Section 5.4.2. 


def init_cnn(module): #@save 
"""Tnitialize weights for CNNs. 
if type(module) == nn.Linear or type(module) == nn.Conv2d: 
nn.init.xavier_uniform_(module. weight) 


nnn 


class LeNet(d21.Classifier): #@save 
"""The LeNet-5 model.””” 
def __init__(self, lr=0.1, num_classes=10): 
super().__init__Q 
self.save_hyperparameters() 
self.net = nn.Sequential ( 
nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(), 
nn.AvgPool2d(kernel_size=2, stride=2), 
nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(), 
nn.AvgPool2d(kernel_size=2, stride=2), 
nn.Flatten(), 
nn.LazyLinear(120), nn.Sigmoid(), 
nn.LazyLinear(84), nn.Sigmoid(), 
nn.LazyLinear (num_classes) ) 


We have taken some liberty in the reproduction of LeNet insofar as we have replaced the 
Gaussian activation layer by a softmax layer. This greatly simplifies the implementation, 
not least due to the fact that the Gaussian decoder is rarely used nowadays. Other than that, 
this network matches the original LeNet-5 architecture. 


Let’s see what happens inside the network. By passing a single-channel (black and white) 
28X28 image through the network and printing the output shape at each layer, we can inspect 
the model to ensure that its operations line up with what we expect from Fig. 7.6.2. 


FC (10) 
ry 


FC (84) 
ry 


FC (120) 
ry 


A 


5x 5 Conv (16) 
ry 


ry 
5 x 5 Conv (6), pad 2 
ry 


Image (28 x 28) 


Compressed notation for LeNet-5. 
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@d21.add_to_class(d21.Classifier) #@save 
def layer_summary(self, X_shape): 
X = torch. randn(*«X_shape) 
for layer in self.net: 
X = layer(X) 


print(layer.__class name__, 'output shape:\t’, X.shape) 


model = LeNet() 
model.layer_summary((1, 1, 28, 28)) 


Conv2d output shape: torch.Size([1, 6, 28, 28]) 
Sigmoid output shape: torch.Size([1, 6, 28, 28]) 
AvgPool2d output shape: torch.Size([1, 6, 14, 14]) 
Conv2d output shape: torch.Size([1, 16, 10, 10]) 
Sigmoid output shape: torch.Size([1, 16, 10, 10]) 
AvgPool2d output shape: torch.Size([1, 16, 5, 5]) 
Flatten output shape: torch.Size([1, 400]) 

Linear output shape: torch.Size([1, 120]) 
Sigmoid output shape: torch.Size([1, 120]) 

Linear output shape: torch.Size([1, 84]) 

Sigmoid output shape: torch.Size([1, 84]) 

Linear output shape: torch.Size([1, 10]) 


Note that the height and width of the representation at each layer throughout the convolu- 
tional block is reduced (compared with the previous layer). The first convolutional layer 
uses two pixels of padding to compensate for the reduction in height and width that would 
otherwise result from using a 5 x 5 kernel. As an aside, the image size of 28 x 28 pixels in 
the original MNIST OCR dataset is a result of trimming two pixel rows (and columns) from 
the original scans that measured 32 x 32 pixels. This was done primarily to save space (a 
30% reduction) at a time when megabytes mattered. 


In contrast, the second convolutional layer forgoes padding, and thus the height and width 
are both reduced by four pixels. As we go up the stack of layers, the number of channels 
increases layer-over-layer from 1 in the input to 6 after the first convolutional layer and 
16 after the second convolutional layer. However, each pooling layer halves the height and 
width. Finally, each fully connected layer reduces dimensionality, finally emitting an output 
whose dimension matches the number of classes. 


7.6.2 Training 


Now that we have implemented the model, let’s run an experiment to see how the LeNet-5 
model fares on Fashion-MNIST. 


While CNNs have fewer parameters, they can still be more expensive to compute than 
similarly deep MLPs because each parameter participates in many more multiplications. 
If you have access to a GPU, this might be a good time to put it into action to speed up 
training. Note that the d21. Trainer class takes care of all details. By default, it initializes 
the model parameters on the available devices. Just as with MLPs, our loss function is 
cross-entropy, and we minimize it via minibatch stochastic gradient descent. 
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trainer = d21.Trainer(max_epochs=10, num_gpus=1) 

data = d21.FashionMNIST (batch_size=128) 

model = LeNet(1r=0.1) 

model. apply_init([next(iter(data.get_dataloader(True)))[0]], init_cnn) 
trainer.fit(model, data) 


— train_loss 

2.0 4 =-=- val_loss 
=- val acc 

1.54 

1.04 

0 5 yer 

— es ee o a 
0.0 T T 
0 2 4 6 8 10 
epoch 


7.6.3 Summary 


We have made significant progress in this chapter. We moved from the MLPs of the 1980s 
to the CNNs of the 1990s and early 2000s. The architectures proposed, e.g., in the form 
of LeNet-5 remain meaningful, even to this day. It is worth comparing the error rates on 
Fashion-MNIST achievable with LeNet-5 both to the very best possible with MLPs (Section 
5.2) and those with significantly more advanced architectures such as ResNet (Section 8.6). 
LeNet is much more similar to the latter than to the former. One of the primary differences, 
as we shall see, is that greater amounts of computation enabled significantly more complex 
architectures. 


A second difference is the relative ease with which we were able to implement LeNet. What 
used to be an engineering challenge worth months of C++ and assembly code, engineering 
to improve SN, an early Lisp-based deep learning tool (Bottou and Le Cun, 1988), and fi- 
nally experimentation with models can now be accomplished in minutes. It is this incredible 
productivity boost that has democratized deep learning model development tremendously. 
In the next chapter we will journey down this rabbit to hole to see where it takes us. 


7.6.4 Exercises 
1. Let’s modernize LeNet. Implement and test the following changes: 
1. Replace average pooling with max-pooling. 
2. Replace the softmax layer with ReLU. 


2. Try to change the size of the LeNet style network to improve its accuracy in addition to 
max-pooling and ReLU. 


1. Adjust the convolution window size. 


2. Adjust the number of output channels. 
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3. Adjust the number of convolution layers. 
4. Adjust the number of fully connected layers. 


5. Adjust the learning rates and other training details (e.g., initialization and number of 
epochs). 


3. Try out the improved network on the original MNIST dataset. 


4. Display the activations of the first and second layer of LeNet for different inputs (e.g., 
sweaters and coats). 


5. What happens to the activations when you feed significantly different images into the 
network (e.g., cats, cars, or even random noise)? 


: +124 
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Modern Convolutional Neural Networks 


Now that we understand the basics of wiring together CNNs, let’s take a tour of modern 
CNN architectures. This tour is, by necessity, incomplete, thanks to the plethora of excit- 
ing new designs being added. Their importance derives from the fact that not only can they 
be used directly for vision tasks, but they also serve as basic feature generators for more 
advanced tasks such as tracking (Zhang et al., 2021), segmentation (Long et al., 2015), ob- 
ject detection (Redmon and Farhadi, 2018), or style transformation (Gatys et al., 2016). In 
this chapter, most sections correspond to a significant CNN architecture that was at some 
point (or currently) the base model upon which many research projects and deployed sys- 
tems were built. Each of these networks was briefly a dominant architecture and many were 
winners or runners-up in the ImageNet competition !?° which has served as a barometer 
of progress on supervised learning in computer vision since 2010. It is only recently that 
Transformers have begun to displace CNNs, starting with Dosovitskiy et al. (2021) and 
followed by the Swin Transformer (Liu et al., 2021). We will cover this development later 
in Chapter 11. 


While the idea of deep neural networks is quite simple (stack together a bunch of layers), 
performance can vary wildly across architectures and hyperparameter choices. The neural 
networks described in this chapter are the product of intuition, a few mathematical insights, 
and a lot of trial and error. We present these models in chronological order, partly to convey 
a sense of the history so that you can form your own intuitions about where the field is 
heading and perhaps develop your own architectures. For instance, batch normalization and 
residual connections described in this chapter have offered two popular ideas for training 
and designing deep models, both of which have since also been applied to architectures 
beyond computer vision. 


We begin our tour of modern CNNs with AlexNet (Krizhevsky et al., 2012), the first large- 
scale network deployed to beat conventional computer vision methods on a large-scale vi- 
sion challenge; the VGG network (Simonyan and Zisserman, 2014), which makes use of a 
number of repeating blocks of elements; the network in network (NiN) that convolves whole 
neural networks patch-wise over inputs (Lin et al., 2013); GoogLeNet that uses networks 
with multi-branch convolutions (Szegedy et al., 2015); the residual network (ResNet) (He 
et al., 2016), which remains one of the most popular off-the-shelf architectures in computer 
vision; ResNeXt blocks (Xie et al., 2017) for sparser connections; and DenseNet (Huang 
et al., 2017) for a generalization of the residual architecture. Over time many special opti- 
mizations for efficient networks have been developed, such as coordinate shifts (ShiftNet) 
(Wu et al., 2018). This culminated in the automatic search for efficient architectures such 
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as MobileNet v3 (Howard et al., 2019). It also includes the semi-automatic design explo- 
ration of Radosavovic et al. (2020) that led to the RegNetX/Y which we will discuss later 
in this chapter. The work is instructive insofar as it offers a path for marrying brute force 
computation with the ingenuity of an experimenter in the search for efficient design spaces. 
Of note is also the work of Liu et al. (2022) as it shows that training techniques (e.g., op- 
timizers, data augmentation, and regularization) play a pivotal role in improving accuracy. 
It also shows that long-held assumptions, such as the size of a convolution window, may 
need to be revisited, given the increase in computation and data. We will cover this and 
many more questions in due course throughout this chapter. 


8.1 Deep Convolutional Neural Networks (AlexNet) 


Although CNNs were well known in the computer vision and machine learning commu- 
nities following the introduction of LeNet (LeCun et al., 1995), they did not immediately 
dominate the field. Although LeNet achieved good results on early small datasets, the per- 
formance and feasibility of training CNNs on larger, more realistic datasets had yet to be 
established. In fact, for much of the intervening time between the early 1990s and the wa- 
tershed results of 2012 (Krizhevsky et al., 2012), neural networks were often surpassed 
by other machine learning methods, such as kernel methods (Schélkopf and Smola, 2002), 
ensemble methods (Freund and Schapire, 1996), and structured estimation (Taskar et al., 
2004). 


For computer vision, this comparison is perhaps not entirely accurate. That is, although 
the inputs to convolutional networks consist of raw or lightly-processed (e.g., by center- 
ing) pixel values, practitioners would never feed raw pixels into traditional models. In- 
stead, typical computer vision pipelines consisted of manually engineering feature extrac- 
tion pipelines, such as SIFT (Lowe, 2004), SURF (Bay et al., 2006), and bags of visual 
words (Sivic and Zisserman, 2003). Rather than learning the features, the features were 
crafted. Most of the progress came from having more clever ideas for feature extraction on 
the one hand and deep insight into geometry (Hartley and Zisserman, 2000) on the other. 
The learning algorithm was often considered an afterthought. 


Although some neural network accelerators were available in the 1990s, they were not yet 
sufficiently powerful to make deep multichannel, multilayer CNNs with a large number 
of parameters. For instance, NVIDIA’s GeForce 256 from 1999 was able to process at 
most 480 million floating-point operations, such as additions and multiplications, per sec- 
ond (MFLOPS), without any meaningful programming framework for operations beyond 
games. Today’s accelerators are able to perform in excess of 1000 TFLOPs per device. 
Moreover, datasets were still relatively small: OCR on 60,000 low-resolution 28 x 28 pixel 
images was considered a highly challenging task. Added to these obstacles, key tricks for 
training neural networks including parameter initialization heuristics (Glorot and Bengio, 
2010), clever variants of stochastic gradient descent (Kingma and Ba, 2014), non-squashing 
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activation functions (Nair and Hinton, 2010), and effective regularization techniques (Sri- 
vastava et al., 2014) were still missing. 


Thus, rather than training end-to-end (pixel to classification) systems, classical pipelines 
looked more like this: 


1. Obtain an interesting dataset. In the early days, these datasets required expensive sen- 
sors. For instance, the Apple QuickTake 100 !7° of 1994 sported a whopping 0.3 
megapixel (VGA) resolution, capable of storing up to 8 images, all for the price of $1000. 


2. Preprocess the dataset with hand-crafted features based on some knowledge of optics, 
geometry, other analytic tools, and occasionally on the serendipitous discoveries by 
lucky graduate students. 


3. Feed the data through a standard set of feature extractors such as the SIFT (scale- 
invariant feature transform) (Lowe, 2004), the SURF (speeded up robust features) (Bay 
et al., 2006), or any number of other hand-tuned pipelines. OpenCV still provides SIFT 
extractors to this day! 


4. Dump the resulting representations into your favorite classifier, likely a linear model or 
kernel method, to train a classifier. 


If you spoke to machine learning researchers, they would reply that machine learning was 
both important and beautiful. Elegant theories proved the properties of various classifiers 
(Boucheron et al., 2005) and convex optimization (Boyd and Vandenberghe, 2004) had 
become the mainstay for obtaining them. The field of machine learning was thriving, rig- 
orous, and eminently useful. However, if you spoke to a computer vision researcher, you 
would hear a very different story. The dirty truth of image recognition, they would tell 
you, is that features, geometry (Hartley and Zisserman, 2000, Hartley and Kahl, 2009), 
and engineering, rather than novel learning algorithms, drove progress. Computer vision 
researchers justifiably believed that a slightly bigger or cleaner dataset or a slightly im- 
proved feature-extraction pipeline mattered far more to the final accuracy than any learning 
algorithm. 


import torch 
from torch import nn 
from d21 import torch as d21 


8.1.1 Representation Learning 


Another way to cast the state of affairs is that the most important part of the pipeline was the 
representation. And up until 2012 the representation was calculated mostly mechanically. 
In fact, engineering a new set of feature functions, improving results, and writing up the 
method all featured prominently in papers. SIFT (Lowe, 2004), SURF (Bay et al., 2006), 
HOG (histograms of oriented gradient) (Dalal and Triggs, 2005), bags of visual words 
(Sivic and Zisserman, 2003), and similar feature extractors ruled the roost. 


Another group of researchers, including Yann LeCun, Geoff Hinton, Yoshua Bengio, An- 
drew Ng, Shun-ichi Amari, and Juergen Schmidhuber, had different plans. They believed 
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that features themselves ought to be learned. Moreover, they believed that to be reasonably 
complex, the features ought to be hierarchically composed with multiple jointly learned 
layers, each with learnable parameters. In the case of an image, the lowest layers might 
come to detect edges, colors, and textures, by analogy with how the visual system in ani- 
mals processes its input. In particular, the automatic design of visual features such as those 
obtained by sparse coding (Olshausen and Field, 1996) remained an open challenge until 
the advent of modern CNNs. It was not until Dean et al. (2012), Le (2013) that the idea of 
generating features from image data automatically gained significant traction. 


The first modern CNN (Krizhevsky et al., 2012), named AlexNet after one of its inventors, 
Alex Krizhevsky, is largely an evolutionary improvement over LeNet. It achieved excellent 
performance in the 2012 ImageNet challenge. 


isi Si!) Image filters learned by the first layer of AlexNet. Reproduction courtesy of Krizhevsky 
et al. (2012). 


Interestingly, in the lowest layers of the network, the model learned feature extractors that 
resembled some traditional filters. Fig. 8.1.1 shows lower-level image descriptors. Higher 
layers in the network might build upon these representations to represent larger structures, 
like eyes, noses, blades of grass, and so on. Even higher layers might represent whole 
objects like people, airplanes, dogs, or frisbees. Ultimately, the final hidden state learns a 
compact representation of the image that summarizes its contents such that data belonging 
to different categories can be easily separated. 


AlexNet (2012) and its precursor LeNet (1995) share many architectural elements. This 
begs the question: why did it take so long? A key difference was that, over the previous two 
decades, the amount of data and the computing power available had increased significantly. 
As such AlexNet was much larger: it was trained on much more data, and on much faster 
GPUs compared to the CPUs available in 1995. 
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Missing Ingredient: Data 


Deep models with many layers require large amounts of data in order to enter the regime 
where they significantly outperform traditional methods based on convex optimizations 
(e.g., linear and kernel methods). However, given the limited storage capacity of computers, 
the relative expense of (imaging) sensors, and the comparatively tighter research budgets 
in the 1990s, most research relied on tiny datasets. Numerous papers relied on the UCI 
collection of datasets, many of which contained only hundreds or (a few) thousands of 
images captured in low resolution and often with an artificially clean background. 


In 2009, the ImageNet dataset was released (Deng et al., 2009), challenging researchers 
to learn models from 1 million examples, 1000 each from 1000 distinct categories of ob- 
jects. The categories themselves were based on the most popular noun nodes in WordNet 
(Miller, 1995). The ImageNet team used Google Image Search to prefilter large candidate 
sets for each category and employed the Amazon Mechanical Turk crowdsourcing pipeline 
to confirm for each image whether it belonged to the associated category. This scale was un- 
precedented, exceeding others by over an order of magnitude (e.g., CIFAR-100 has 60,000 
images). Another aspect was that the images were at relatively high resolution of 224 x 224 
pixels, unlike the 80 million-sized TinyImages dataset (Torralba et al., 2008), consisting 
of 32 x 32 pixel thumbnails. This allowed for the formation of higher-level features. The 
associated competition, dubbed the ImageNet Large Scale Visual Recognition Challenge 
(Russakovsky et al., 2015), pushed computer vision and machine learning research for- 
ward, challenging researchers to identify which models performed best at a greater scale 
than academics had previously considered. The largest vision datasets, such as LAION-5B 
(Schuhmann et al., 2022) contain billions of images with additional metadata. 


Missing Ingredient: Hardware 


Deep learning models are voracious consumers of compute cycles. Training can take hun- 
dreds of epochs, and each iteration requires passing data through many layers of compu- 
tationally expensive linear algebra operations. This is one of the main reasons why in the 
1990s and early 2000s, simple algorithms based on the more-efficiently optimized convex 
objectives were preferred. 


Graphical processing units (GPUs) proved to be a game changer in making deep learn- 
ing feasible. These chips had earlier been developed for accelerating graphics processing 
to benefit computer games. In particular, they were optimized for high throughput 4 x 4 
matrix—vector products, which are needed for many computer graphics tasks. Fortunately, 
the math is strikingly similar to that required for calculating convolutional layers. Around 
that time, NVIDIA and ATI had begun optimizing GPUs for general computing opera- 
tions (Fernando, 2004), going as far as to market them as general-purpose GPUs (GPG- 
PUs). 


To provide some intuition, consider the cores of a modern microprocessor (CPU). Each 
of the cores is fairly powerful running at a high clock frequency and sporting large caches 
(up to several megabytes of L3). Each core is well-suited to executing a wide range of in- 
structions, with branch predictors, a deep pipeline, specialized execution units, speculative 
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execution, and many other bells and whistles that enable it to run a large variety of pro- 
grams with sophisticated control flow. This apparent strength, however, is also its Achilles 
heel: general-purpose cores are very expensive to build. They excel at general-purpose 
code with lots of control flow. This requires lots of chip area, not just for the actual ALU 
(arithmetic logical unit) where computation happens, but also for all the aforementioned 
bells and whistles, plus memory interfaces, caching logic between cores, high-speed in- 
terconnects, and so on. CPUs are comparatively bad at any single task when compared 
with dedicated hardware. Modern laptops have 4-8 cores, and even high-end servers rarely 
exceed 64 cores per socket, simply because it is not cost-effective. 


By comparison, GPUs can consist of thousands of small processing elements (NIVIDA’s 
latest Ampere chips have up to 6912 CUDA cores), often grouped into larger groups (NVIDIA 
calls them warps). The details differ somewhat between NVIDIA, AMD, ARM and other 
chip vendors. While each core is relatively weak, running at about 1GHz clock frequency, 
it is the total number of such cores that makes GPUs orders of magnitude faster than 
CPUs. For instance, NVIDIA’s recent Ampere A100 GPU offers over 300 TFLOPs per 
chip for specialized 16-bit precision (BFLOAT16) matrix-matrix multiplications, and up 
to 20 TFLOPs for more general-purpose floating point operations (FP32). At the same 
time, floating point performance of CPUs rarely exceeds 1 TFLOPs. For instance, Ama- 
zon’s Graviton 3 reaches 2 TFLOPs peak performance for 16-bit precision operations, a 
number similar to the GPU performance of Apple’s M1 processor. 


There are many reasons why GPUs are much faster than CPUs in terms of FLOPs. First, 
power consumption tends to grow quadratically with clock frequency. Hence, for the power 
budget of a CPU core that runs four times faster (a typical number), you can use 16 GPU 
cores at i the speed, which yields 16x = 4 times the performance. Second, GPU cores are 
much simpler (in fact, for a long time they were not even able to execute general-purpose 
code), which makes them more energy efficient. For instance, (i) they tend not to support 
speculative evaluation, (ii) it typically is not possible to program each processing element 
individually, and (iii) the caches per core tend to be much smaller. Last, many operations 
in deep learning require high memory bandwidth. Again, GPUs shine here with buses that 
are at least 10 times as wide as many CPUs. 


Back to 2012. A major breakthrough came when Alex Krizhevsky and Ilya Sutskever im- 
plemented a deep CNN that could run on GPUs. They realized that the computational bot- 
tlenecks in CNNs, convolutions and matrix multiplications, are all operations that could be 
parallelized in hardware. Using two NVIDIA GTX 580s with 3GB of memory, either of 
which was capable of 1.5 TFLOPs (still a challenge for most CPUs a decade later), they im- 
plemented fast convolutions. The cuda-convnet !*” code was good enough that for several 
years it was the industry standard and powered the first couple of years of the deep learning 
boom. 


8.1.2 AlexNet 


AlexNet, which employed an 8-layer CNN, won the ImageNet Large Scale Visual Recog- 
nition Challenge 2012 by a large margin (Russakovsky et al., 2013). This network showed, 
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for the first time, that the features obtained by learning can transcend manually-designed 
features, breaking the previous paradigm in computer vision. 


The architectures of AlexNet and LeNet are strikingly similar, as Fig. 8.1.2 illustrates. Note 
that we provide a slightly streamlined version of AlexNet removing some of the design 
quirks that were needed in 2012 to make the model fit on two small GPUs. 


FC (1000) 
FC (4096) 


FC (4096) 


3 x 3 Conv (256), pad 1 


( 
ry 

FC (84) 3 x 3 Conv (384), pad 1 
ry 

FC (120) 3 x 3 Conv (384), pad 1 


5 x 5 Conv (16) 5 x 5 Conv (256), pad 2 
ry 
A 
5 x 5 Conv (6), pad 2 11 x 11 Conv (96), stride 4 
ry 
Image (28 x 28) Image (3 x 224 x 224) 


From LeNet (left) to AlexNet (right). 


There are also significant differences between AlexNet and LeNet. First, AlexNet is much 
deeper than the comparatively small LeNet-5. AlexNet consists of eight layers: five con- 
volutional layers, two fully connected hidden layers, and one fully connected output layer. 
Second, AlexNet used the ReLU instead of the sigmoid as its activation function. Let’s 
delve into the details below. 


Architecture 


In AlexNet’s first layer, the convolution window shape is 11 x 11. Since the images in 
ImageNet are eight times taller and wider than the MNIST images, objects in ImageNet 
data tend to occupy more pixels with more visual detail. Consequently, a larger convolution 
window is needed to capture the object. The convolution window shape in the second 
layer is reduced to 5 x 5, followed by 3 x 3. In addition, after the first, second, and fifth 
convolutional layers, the network adds max-pooling layers with a window shape of 3 x 
3 and a stride of 2. Moreover, AlexNet has ten times more convolution channels than 
LeNet. 


After the final convolutional layer, there are two huge fully connected layers with 4096 out- 
puts. These layers require nearly 1GB model parameters. Because of the limited memory 
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in early GPUs, the original AlexNet used a dual data stream design, so that each of their 
two GPUs could be responsible for storing and computing only its half of the model. Fortu- 
nately, GPU memory is comparatively abundant now, so we rarely need to break up models 
across GPUs these days (our version of the AlexNet model deviates from the original paper 
in this aspect). 


Activation Functions 


Furthermore, AlexNet changed the sigmoid activation function to a simpler ReLU activa- 
tion function. On the one hand, the computation of the ReLU activation function is simpler. 
For example, it does not have the exponentiation operation found in the sigmoid activation 
function. On the other hand, the ReLU activation function makes model training easier 
when using different parameter initialization methods. This is because, when the output 
of the sigmoid activation function is very close to 0 or 1, the gradient of these regions is 
almost 0, so that backpropagation cannot continue to update some of the model parameters. 
By contrast, the gradient of the ReLU activation function in the positive interval is always 1 
(Section 5.1.2). Therefore, if the model parameters are not properly initialized, the sigmoid 
function may obtain a gradient of almost 0 in the positive interval, meaning that the model 
cannot be effectively trained. 


Capacity Control and Preprocessing 


AlexNet controls the model complexity of the fully connected layer by dropout (Section 
5.6), while LeNet only uses weight decay. To augment the data even further, the training 
loop of AlexNet added a great deal of image augmentation, such as flipping, clipping, and 
color changes. This makes the model more robust and the larger sample size effectively 
reduces overfitting. See Buslaev et al. (2020) for an in-depth review of such preprocessing 
steps. 


class AlexNet(d21.Classifier): 
def __init__(self, lr=0.1, num_classes=10): 

super().__init__Q 

self.save_hyperparameters() 

self.net = nn.Sequential ( 
nn.LazyConv2d(96, kernel_size=11, stride=4, padding=1), 
nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2), 
nn.LazyConv2d(256, kernel_size=5, padding=2), nn.ReLU(), 
nn.MaxPool2d(kernel_size=3, stride=2), 
nn.LazyConv2d(384, kernel_size=3, padding=1), nn.ReLU(), 
nn.LazyConv2d(384, kernel_size=3, padding=1), nn.ReLU(), 
nn.LazyConv2d(256, kernel_size=3, padding=1), nn.ReLU(), 
nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(), 
nn.LazyLinear (4096), nn.ReLU(), nn.Dropout(p=0.5) , 
nn.LazyLinear (4096), nn.ReLU(),nn.Dropout(p=0.5), 
nn.LazyLinear (num_classes) ) 

self.net.apply(d21.init_cnn) 


We construct a single-channel data example with both height and width of 224 to observe 
the output shape of each layer. It matches the AlexNet architecture in Fig. 8.1.2. 
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Conv2d output shape: 

ReLU output shape: torch. 
MaxPool2d output shape: 
Conv2d output shape: 

ReLU output shape: torch. 
MaxPool2d output shape: 


Conv2d output shape: 
ReLU output shape: torch. 
Conv2d output shape: 
ReLU output shape: torch. 
Conv2d output shape: 
ReLU output shape: torch. 


MaxPool2d output shape: 
Flatten output shape: 
Linear output shape: 


ReLU output shape: torch. 
Dropout output shape: 
Linear output shape: 
ReLU output shape: torch. 


Dropout output shape: 
Linear output shape: 


Although AlexNet was trained on ImageNet in Krizhevsky et al. (2012), we use Fashion- 
MNIST here since training an ImageNet model to convergence could take hours or days 
even on a modern GPU. One of the problems with applying AlexNet directly on Fashion- 
MNIST is that its images have lower resolution (28 x 28 pixels) than ImageNet images. To 
make things work, we upsample them to 224x224. This is generally not a smart practice, as 
it simply increases the computational complexity without adding information. Nonetheless, 
we do it here to be faithful to the AlexNet architecture. We perform this resizing with the 


torch. Size( 
Size(L 
torch. Size( 


[1, 96, 54, 54]) 
[1, 96, 54, 54]) 
[1, 96, 26, 26]) 


torch.Size([1, 256, 26, 26] 
Size([1, 256, 26, 26]) 
torch.Size([1, 256, 12, 12] 
torch.Size([1, 384, 12, 12] 
Size([1, 384, 12, 12]) 
torch.Size([1, 384, 12, 12] 
Size([1, 384, 12, 12]) 
torch.Size([1, 256, 12, 12] 
Size([1, 256, 12, 12]) 
torch.Size([1, 256, 5, 5]) 
torch.Size([1, 6400]) 
torch.Size([1, 4096]) 
Size([1, 4096]) 
torch.Size([1, 4096]) 
torch.Size([1, 4096]) 
Size([1, 4096]) 
torch.Size([1, 4096]) 
torch.Size([1, 10]) 


8.1.3 Training 


resize argument in the d21.FashionMNIST constructor. 


Now, we can start training AlexNet. Compared to LeNet in Section 7.6, the main change 
here is the use of a smaller learning rate and much slower training due to the deeper and 
wider network, the higher image resolution, and the more costly convolutions. 


model = AlexNet(1r=0.01) 
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) 


data = d21.FashionMNIST(batch_size=128, resize=(224, 224)) 
trainer = d21.Trainer(max_epochs=10, num_gpus=1) 


trainer.fit(model, data) 


AlexNet’s structure bears a striking resemblance to LeNet, with a number of critical im- 
provements, both for accuracy (dropout) and for ease of training (ReLU). What is equally 
striking is the amount of progress that has been made in terms of deep learning tooling. 


8.1.4 Discussion 
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— train loss 

2.04 === val_loss 
—-- val_acc 

1.54 

1.05 

0.54 Pa 

a 
A 
0.0 T T 
0 2 4 6 8 10 
epoch 


What was several months of work in 2012 can now be accomplished in a dozen lines of 
code using any modern framework. 


Reviewing the architecture, we see that AlexNet has an Achilles heel when it comes to effi- 
ciency: the last two hidden layers require matrices of size 6400 x 4096 and 4096 x 4096, re- 
spectively. This corresponds to 164 MB of memory and 81 MFLOPs of computation, both 
of which are a nontrivial outlay, especially on smaller devices, such as mobile phones. This 
is one of the reasons why AlexNet has been surpassed by much more effective architectures 
that we will cover in the following sections. Nonetheless, it is a key step from shallow to 
deep networks that are used nowadays. Note that even though the number of parameters 
exceeds by far the amount of training data in our experiments (the last two layers have more 
than 40 million parameters, trained on a datasets of 60 thousand images), there is hardly 
any overfitting: training and validation loss are virtually identical throughout training. This 
is due to the improved regularization, such as dropout, inherent in modern deep network 
designs. 


Although it seems that there are only a few more lines in AlexNet’s implementation than 
in LeNet’s, it took the academic community many years to embrace this conceptual change 
and take advantage of its excellent experimental results. This was also due to the lack of 
efficient computational tools. At the time neither DistBelief (Dean et al., 2012) nor Caffe 
(Jia et al., 2014) existed, and Theano (Bergstra et al., 2010) still lacked many distinguishing 
features. It was the availability of TensorFlow (Abadi et al., 2016) that dramatically changed 
the situation. 


8.1.5 Exercises 
1. Following up on the discussion above, analyze the computational properties of AlexNet. 


1. Compute the memory footprint for convolutions and fully connected layers, respec- 
tively. Which one dominates? 


2. Calculate the computational cost for the convolutions and the fully connected layers. 


3. How does the memory (read and write bandwidth, latency, size) affect computation? 
Is there any difference in its effects for training and inference? 


2. You are a chip designer and need to trade off computation and memory bandwidth. 
For example, a faster chip requires more power and possibly a larger chip area. More 
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memory bandwidth requires more pins and control logic, thus also more area. How do 
you optimize? 


3. Why do engineers no longer report performance benchmarks on AlexNet? 


4. Try increasing the number of epochs when training AlexNet. Compared with LeNet, 
how do the results differ? Why? 


5. AlexNet may be too complex for the Fashion-MNIST dataset, in particular due to the 
low resolution of the initial images. 


1. Try simplifying the model to make the training faster, while ensuring that the accu- 
racy does not drop significantly. 


2. Design a better model that works directly on 28 x 28 images. 


6. Modify the batch size, and observe the changes in throughput (images/s), accuracy, and 
GPU memory. 


7. Apply dropout and ReLU to LeNet-5. Does it improve? Can you improve things further 
by preprocessing to take advantage of the invariances inherent in the images? 


8. Can you make AlexNet overfit? Which feature do you need to remove or change to break 
training? 


Discussions !?°. 


8.2 Networks Using Blocks (VGG) 
————S————————— ae 


While AlexNet offered empirical evidence that deep CNNs can achieve good results, it did 
not provide a general template to guide subsequent researchers in designing new networks. 
In the following sections, we will introduce several heuristic concepts commonly used to 
design deep networks. 


Progress in this field mirrors that of VLSI (very large scale integration) in chip design where 
engineers moved from placing transistors to logical elements to logic blocks (Mead, 1980). 
Similarly, the design of neural network architectures has grown progressively more abstract, 
with researchers moving from thinking in terms of individual neurons to whole layers, 
and now to blocks, repeating patterns of layers. A decade later, this has now progressed 
to researchers using entire trained models to repurpose them for different, albeit related, 
tasks. Such large pretrained models are typically called foundation models (Bommasani et 
al., 2021). 


Back to network design. The idea of using blocks first emerged from the Visual Geometry 
Group (VGG) at Oxford University, in their eponymously-named VGG network (Simonyan 
and Zisserman, 2014). It is easy to implement these repeated structures in code with any 
modern deep learning framework by using loops and subroutines. 
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import torch 
from torch import nn 
from d21 import torch as d21 


8.2.1 VGG Blocks 


The basic building block of CNNs is a sequence of the following: (i) a convolutional layer 
with padding to maintain the resolution, (ii) a nonlinearity such as a ReLU, (iii) a pooling 
layer such as max-pooling to reduce the resolution. One of the problems with this approach 
is that the spatial resolution decreases quite rapidly. In particular, this imposes a hard limit 
of log, d convolutional layers on the network before all dimensions (d) are used up. For 
instance, in the case of ImageNet, it would be impossible to have more than 8 convolutional 
layers in this way. 


The key idea of Simonyan and Zisserman (2014) was to use multiple convolutions in be- 
tween downsampling via max-pooling in the form of a block. They were primarily in- 
terested in whether deep or wide networks perform better. For instance, the successive 
application of two 3 x 3 convolutions touches the same pixels as a single 5 x 5 convolution 
does. At the same time, the latter uses approximately as many parameters (25 - c°) as three 
3 x3 convolutions do (3-9-c?). In a rather detailed analysis they showed that deep and nar- 
row networks significantly outperform their shallow counterparts. This set deep learning 
on a quest for ever deeper networks with over 100 layers for typical applications. Stacking 
3 x 3 convolutions has become a gold standard in later deep networks (a design decision 
only to be revisited recently by Liu et al. (2022)). Consequently, fast implementations for 
small convolutions have become a staple on GPUs (Lavin and Gray, 2016). 


Back to VGG: a VGG block consists of a sequence of convolutions with 3 x 3 kernels with 
padding of | (keeping height and width) followed by a 2 x 2 max-pooling layer with stride 
of 2 (halving height and width after each block). In the code below, we define a function 
called vgg_block to implement one VGG block. 


The function below takes two arguments, corresponding to the number of convolutional 
layers num_convs and the number of output channels num_channels. 


def vgg_block(num_convs, out_channels): 
layers = [] 
for _ in range(num_convs): 
layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1)) 
layers.append(nn.ReLU()) 
layers. append(nn.MaxPool2d(kernel_size=2,stride=2)) 
return nn.Sequential(*layers) 


8.2.2 VGG Network 


Like AlexNet and LeNet, the VGG Network can be partitioned into two parts: the first 
consisting mostly of convolutional and pooling layers and the second consisting of fully 
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connected layers that are identical to those in AlexNet. The key difference is that the con- 
volutional layers are grouped in nonlinear transformations that leave the dimensonality un- 
changed, followed by a resolution-reduction step, as depicted in Fig. 8.2.1. 


AlexNet 


VGG 


FC (4096) 


3 x 3 Conv (384), pad 1 
3 x 3 Conv (384), pad 1 
3 x 3 Conv (384), pad 1 


VGG block 


3 x 3 Conv, pad 1 


5 x 5 Conv (256), pad 2 


3x 3 Conv, pad 1 


ri 


11 x 11 Conv (96), stride 4 


From AlexNet to VGG. The key difference is that VGG consists of blocks of layers, 
whereas AlexNet’s layers are all designed individually. 


The convolutional part of the network connects several VGG blocks from Fig. 8.2.1 (also 
defined in the vgg_block function) in succession. This grouping of convolutions is a pat- 
tern that has remained almost unchanged over the past decade, although the specific choice 
of operations has undergone considerable modifications. The variable arch consists of a 
list of tuples (one per block), where each contains two values: the number of convolutional 
layers and the number of output channels, which are precisely the arguments required to 
call the vgg_block function. As such, VGG defines a family of networks rather than just a 
specific manifestation. To build a specific network we simply iterate over arch to compose 
the blocks. 


class VGG(d21.Classifier): 
def __init__(self, arch, lr=0.1, num_classes=10): 

super().__init__Q 

self .save_hyperparameters() 

conv_blks = [] 

for (num_convs, out_channels) in arch: 
conv_blks.append(vgg_block(num_convs, out_channels)) 

self.net = nn.Sequential ( 
*conv_blks, nn.Flatten(), 


(continues on next page) 
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(continued from previous page) 


nn.LazyLinear (4096), nn.ReLU(), nn.Dropout(@.5), 

nn.LazyLinear (4096), nn.ReLU(), nn.Dropout(@.5), 

nn.LazyLinear (num_classes) ) 
self.net.apply(d21.init_cnn) 


The original VGG network had five convolutional blocks, among which the first two have 
one convolutional layer each and the latter three contain two convolutional layers each. The 
first block has 64 output channels and each subsequent block doubles the number of output 
channels, until that number reaches 512. Since this network uses eight convolutional layers 
and three fully connected layers, it is often called VGG-11. 


VGG(arch=((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))).layer_summary ( 
GL, il, 2x, PAA) 


Dropout output shape: torch.Size([1, 4096]) 
[1, 10]) 


Sequential output shape: torch.Size([1, 64, 112, 112]) 
Sequential output shape: torch.Size([1, 128, 56, 56]) 
Sequential output shape: torch.Size([1, 256, 28, 28]) 
Sequential output shape: torch.Size([1, 512, 14, 14]) 
Sequential output shape: torch.Size([1, 512, 7, 7]) 
Flatten output shape: torch.Size([1, 25088]) 
Linear output shape: torch.Size([1, 4096]) 
ReLU output shape: torch.Size([1, 4096]) 
Dropout output shape: torch.Size([1, 4096]) 
Linear output shape: torch.Size([1, 4096]) 
ReLU output shape: torch.Size([1, 4096]) 

1 

1 


Linear output shape: torch. Size( 


As you can see, we halve height and width at each block, finally reaching a height and width 
of 7 before flattening the representations for processing by the fully connected part of the 
network. Simonyan and Zisserman (2014) described several other variants of VGG. In 
fact, it has become the norm to propose families of networks with different speed—accuracy 
trade-off when introducing a new architecture. 


8.2.3 Training 


Since VGG-11 is computationally more demanding than AlexNet we construct a network 
with a smaller number of channels. This is more than sufficient for training on Fashion- 
MNIST. The model training process is similar to that of AlexNet in Section 8.1. Again ob- 
serve the close match between validation and training loss, suggesting only a small amount 
of overfitting. 


model = VGG(arch=((1, 16), (1, 32), (2, 64), (2, 128), (2, 128)), 1r=0.01) 
trainer = d21.Trainer(max_epochs=10, num_gpus=1) 

data = d21.FashionMNIST(batch_size=128, resize=(224, 224)) 

model .apply_init([next(iter (data. get_dataloader(True)))[0]], d21.init_cnn) 
trainer.fit(model, data) 
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— train loss 
2.0 4 --- val_loss 
—-- val_acc 
1.55 
1.05 
0.5 
0 2 4 6 8 10 
epoch 


8.2.4 Summary 


One might argue that VGG is the first truly modern convolutional neural network. While 
AlexNet introduced many of the components of what make deep learning effective at scale, 
itis VGG that arguably introduced key properties such as blocks of multiple convolutions 
and a preference for deep and narrow networks. It is also the first network that is actually 
an entire family of similarly parametrized models, giving the practitioner ample trade-off 
between complexity and speed. This is also the place where modern deep learning frame- 
works shine. It is no longer necessary to generate XML configuration files to specify a 
network but rather, to assemble said networks through simple Python code. 


More recently ParNet (Goyal et al., 2021) demonstrated that it is possible to achieve com- 
petitive performance using a much more shallow architecture through a large number of 
parallel computations. This is an exciting development and there is hope that it will influ- 
ence architecture designs in the future. For the remainder of the chapter, though, we will 
follow the path of scientific progress over the past decade. 


8.2.5 Exercises 


1. Compared with AlexNet, VGG is much slower in terms of computation, and it also needs 
more GPU memory. 


1. Compare the number of parameters needed for AlexNet and VGG. 


2. Compare the number of floating point operations used in the convolutional layers 
and in the fully connected layers. 


3. How could you reduce the computational cost created by the fully connected layers? 


2. When displaying the dimensions associated with the various layers of the network, we 
only see the information associated with eight blocks (plus some auxiliary transforms), 
even though the network has 11 layers. Where did the remaining three layers go? 


3. Use Table 1 in the VGG paper (Simonyan and Zisserman, 2014) to construct other com- 
mon models, such as VGG-16 or VGG-19. 


4. Upsampling the resolution in Fashion-MNIST eight-fold from 28 x 28 to 224 x 224 
dimensions is very wasteful. Try modifying the network architecture and resolution 
conversion, e.g., to 56 or to 84 dimensions for its input instead. Can you do so without 
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reducing the accuracy of the network? Consult the VGG paper (Simonyan and Zisser- 
man, 2014) for ideas on adding more nonlinearities prior to downsampling. 


Discussions !?9. 


8.3 Network in Network (NiN) 
E) 


LeNet, AlexNet, and VGG all share a common design pattern: extract features exploiting 
spatial structure via a sequence of convolutions and pooling layers and post-process the 
representations via fully connected layers. The improvements upon LeNet by AlexNet and 
VGG mainly lie in how these later networks widen and deepen these two modules. 


This design poses two major challenges. First, the fully connected layers at the end of 
the architecture consume tremendous numbers of parameters. For instance, even a simple 
model such as VGG-11 requires a monstrous matrix, occupying almost 400MB of RAM 
in single precision (FP32). This is a significant impediment to computation, in particular 
on mobile and embedded devices. After all, even high-end mobile phones sport no more 
than 8GB of RAM. At the time VGG was invented, this was an order of magnitude less 
(the iPhone 4S had 512MB). As such, it would have been difficult to justify spending the 
majority of memory on an image classifier. 


Second, it is equally impossible to add fully connected layers earlier in the network to 
increase the degree of nonlinearity: doing so would destroy the spatial structure and require 
potentially even more memory. 


The network in network (NiN) blocks (Lin et al., 2013) offer an alternative, capable of 
solving both problems in one simple strategy. They were proposed based on a very simple 
insight: (i) use 1 x 1 convolutions to add local nonlinearities across the channel activations 
and (ii) use global average pooling to integrate across all locations in the last representation 
layer. Note that global average pooling would not be effective, were it not for the added 
nonlinearities. Let’s dive into this in detail. 


import torch 
from torch import nn 
from d21 import torch as d21 


8.3.1 NiN Blocks 


Recall Section 7.4.3. In it we said that the inputs and outputs of convolutional layers consist 
of four-dimensional tensors with axes corresponding to the example, channel, height, and 
width. Also recall that the inputs and outputs of fully connected layers are typically two- 
dimensional tensors corresponding to the example and feature. The idea behind NiN is 
to apply a fully connected layer at each pixel location (for each height and width). The 
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resulting 1x 1 convolution can be thought of as a fully connected layer acting independently 
on each pixel location. 


Fig. 8.3.1 illustrates the main structural differences between VGG and NiN, and their blocks. 
Note both the difference in the NiN blocks (the initial convolution is followed by 1 x 1 con- 
volutions, whereas VGG retains 3 x 3 convolutions) and at the end where we no longer 
require a giant fully connected layer. 


NiN 


3 x 3 Conv (10), pad 1 


(Ramer | 


VGG t 


FC (1000) 


| 


3 x 3 Conv (384), pad 1 


VGG block 


NiN block 5 x 5 Conv (256), pad 2 


| 


3x 3 Conv, pad 1 


3 x 3 Conv, pad 1 


Comparing the architectures of VGG and NiN, and of their blocks. 


1x1 Conv 


Conv 11 x 11 Conv (96), stride 4 


HiHi 


def nin_block(out_channels, kernel_size, strides, padding): 
return nn.Sequential( 
nn.LazyConv2d(out_channels, kernel_size, strides, padding), nn.ReLU(), 
nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU(), 
nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU()) 


8.3.2 NiN Model 


NiN uses the same initial convolution sizes as AlexNet (it was proposed shortly thereafter). 
The kernel sizes are 11 x 11, 5 x 5, and 3 x 3, respectively, and the numbers of output 
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channels match those of AlexNet. Each NiN block is followed by a max-pooling layer with 


a stride of 2 and a window shape of 3 x 3. 


The second significant difference between NiN and both AlexNet and VGG is that NiN 
avoids fully connected layers altogether. Instead, NiN uses a NiN block with a number of 
output channels equal to the number of label classes, followed by a global average pooling 
layer, yielding a vector of logits. This design significantly reduces the number of required 
model parameters, albeit at the expense of a potential increase in training time. 


class NiN(d21.Classifier): 


Network in Network (NiN) 


def __init__(self, lr=0.1, num_classes=10): 


süper On Imit ©) 


self.save_hyperparameters() 


self.net = nn.Sequential ( 


nin_block(96, kernel_size=11, strides=4, padding=0), 


nn.MaxPool2d(3, stride=2), 


nin_block(256, kernel_size=5, strides=1, padding=2), 


nn.MaxPool2d(3, stride=2), 


nin_block(384, kernel_size=3, strides=1, padding=1), 


nn.MaxPool2d(3, stride=2), 


nn.Dropout(@.5), 
nin_block(num_classes, kernel_size=3, strides=1, padding=1), 


nn.AdaptiveAvgPool2d((1, 1)), 


nn.Flatten()) 


self .net.apply(d21.init_cnn) 


We create a data example to see the output shape of each block. 


NiN().layer_summary((1, 1, 224, 224)) 


Sequential output shape: 
MaxPool2d output shape: 
Sequential output shape: 
MaxPool2d output shape: 
Sequential output shape: 
MaxPool2d output shape: 
Dropout output shape: 
Sequential output shape: 


torch. 
torch. 


torch 
torch 


AdaptiveAvgPool2d output shape: 


Flatten output shape: 


As before we use Fashion-MNIST to train the model using the same optimizer that we used 


for AlexNet and VGG. 


model = NiN(1r=0.05) 


torch. 


Size( 
Size( 


.Size( 
.Size( 
torch. 
torch. 
torch. 
torch. 


Size( 
Size( 
Size( 
Size( 


torch.Size([1, 10, 1, 1]) 
[1, 10]) 


Size( 


1, 96, 54, 54]) 
1, 96, 26, 26]) 
1, 256, 26, 26]) 
1, 256, 12, 12]) 
1, 384, 12, 121) 
1, 384, 5, 5J) 
1, 384, 5, 5J) 
1, 10, 5, 5J) 


8.3.3 Training 


trainer = d21.Trainer(max_epochs=10, num_gpus=1) 
data = d21.FashionMNIST(batch_size=128, resize=(224, 224)) 


model. apply_init([next(iter(data.get_dataloader(True)))[0]], d21.init_cnn) 


trainer.fit(model, data) 
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— train loss 
==> val_loss 
—-- val_acc 


8.3.4 Summary 


NiN has dramatically fewer parameters than AlexNet and VGG. This stems primarily from 
the fact that it needs no giant fully connected layers. Instead, it uses global average pooling 
to aggregate across all image locations after the last stage of the network body. This obvi- 
ates the need for expensive (learned) reduction operations and replaces them by a simple 
average. What surprised researchers at the time was the fact that this averaging operation 
did not harm accuracy. Note that averaging across a low-resolution representation (with 
many channels) also adds to the amount of translation invariance that the network can han- 
dle. 


Choosing fewer convolutions with wide kernels and replacing them by 1 x 1 convolutions 
aids the quest for fewer parameters further. It can cater for a significant amount of non- 
linearity across channels within any given location. Both 1 x 1 convolutions and global 
average pooling significantly influenced subsequent CNN designs. 


8.3.5 Exercises 


1. Why are there two 1 x 1 convolutional layers per NiN block? Increase their number to 
three. Reduce their number to one. What changes? 


2. What changes if you replace the 1 x 1 convolutions by 3 x 3 convolutions? 


3. What happens if you replace the global average pooling by a fully connected layer 
(speed, accuracy, number of parameters)? 


4. Calculate the resource usage for NiN. 
1. What is the number of parameters? 
2. What is the amount of computation? 
3. What is the amount of memory needed during training? 
4. What is the amount of memory needed during prediction? 


5. What are possible problems with reducing the 384 x 5 x 5 representation toa 10 x5 x5 
representation in one step? 


6. Use the structural design decisions in VGG that led to VGG-11, VGG-16, and VGG-19 
to design a family of NiN-like networks. 
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Discussions !°. 


8.4 Multi-Branch Networks (GoogLeNet) 


In 2014, GoogLeNet won the ImageNet Challenge (Szegedy et al., 2015), using a structure 
that combined the strengths of NiN (Lin et al., 2013), repeated blocks (Simonyan and Zis- 
serman, 2014), and a cocktail of convolution kernels. It was arguably also the first network 
that exhibited a clear distinction among the stem (data ingest), body (data processing), and 
head (prediction) in a CNN. This design pattern has persisted ever since in the design of 
deep networks: the stem is given by the first two or three convolutions that operate on the 
image. They extract low-level features from the underlying images. This is followed by a 
body of convolutional blocks. Finally, the head maps the features obtained so far to the 
required classification, segmentation, detection, or tracking problem at hand. 


The key contribution in GoogLeNet was the design of the network body. It solved the prob- 
lem of selecting convolution kernels in an ingenious way. While other works tried to iden- 
tify which convolution, ranging from 1x 1 to 11 x 11 would be best, it simply concatenated 
multi-branch convolutions. In what follows we introduce a slightly simplified version of 
GoogLeNet: the original design included a number of tricks for stabilizing training through 
intermediate loss functions, applied to multiple layers of the network. They are no longer 
necessary due to the availability of improved training algorithms. 


import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


8.4.1 Inception Blocks 


The basic convolutional block in GoogLeNet is called an Inception block, stemming from 
the meme “we need to go deeper” from the movie Inception. 


1 | Structure of the Inception block. 


As depicted in Fig. 8.4.1, the inception block consists of four parallel branches. The first 
three branches use convolutional layers with window sizes of 1 x 1, 3 x 3, and 5 x 5 to 
extract information from different spatial sizes. The middle two branches also add a 1 x 1 
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convolution of the input to reduce the number of channels, reducing the model’s complex- 
ity. The fourth branch uses a 3 x 3 max-pooling layer, followed by a 1 x 1 convolutional 
layer to change the number of channels. The four branches all use appropriate padding 
to give the input and output the same height and width. Finally, the outputs along each 
branch are concatenated along the channel dimension and comprise the block’s output. The 
commonly-tuned hyperparameters of the Inception block are the number of output channels 
per layer, i.e., how to allocate capacity among convolutions of different size. 


class Inception(nn.Module): 

# cl--c4 are the number of output channels for each branch 

def __init__(self, cl, c2, c3, c4, *xkwargs): 
super(Inception, self).__init__(**kwargs) 
# Branch 1 
self.b1_1 = nn.LazyConv2d(cl, kernel_size=1) 
# Branch 2 
self.b2_1 = nn.LazyConv2d(c2[0], kernel_size=1) 
self.b2_2 = nn.LazyConv2d(c2[1], kernel_size=3, padding=1) 
# Branch 3 
self.b3_1 = nn.LazyConv2d(c3[0], kernel_size=1) 
self.b3_2 = nn.LazyConv2d(c3[1], kernel_size=5, padding=2) 
# Branch 4 
self.b4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1) 
self.b4_2 = nn.LazyConv2d(c4, kernel_size=1) 


def forward(self, x): 

b1 = F.relu(self.b1_1(x)) 

b2 = F.relu(self.b2_2(F.relu(self.b2_1(x)))) 
b3 = F.relu(self.b3_2(F.relu(self.b3_1(x)))) 
b4 = F.relu(self.b4_2(self.b4_1(x))) 

return torch.cat((b1, b2, b3, b4), dim=1) 


To gain some intuition for why this network works so well, consider the combination of 
the filters. They explore the image in a variety of filter sizes. This means that details at 
different extents can be recognized efficiently by filters of different sizes. At the same time, 
we can allocate different amounts of parameters for different filters. 


8.4.2 GoogLeNet Model 


As shown in Fig. 8.4.2, GoogLeNet uses a stack of a total of 9 inception blocks, arranged 
into three groups with max-pooling in between, and global average pooling in its head to 
generate its estimates. Max-pooling between inception blocks reduces the dimensionality. 
At its stem, the first module is similar to AlexNet and LeNet. 
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The GoogLeNet architecture. 
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We can now implement GoogLeNet piece by piece. Let’s begin with the stem. The first 
module uses a 64-channel 7 x 7 convolutional layer. 


class GoogleNet(d21.Classifier): 
def bl(self): 
return nn.Sequential( 
nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3), 
nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) 


The second module uses two convolutional layers: first, a 64-channel 1 x 1 convolutional 
layer, followed by a 3 x 3 convolutional layer that triples the number of channels. This 
corresponds to the second branch in the Inception block and concludes the design of the 
body. At this point we have 192 channels. 


@d21.add_to_class(GoogleNet) 
def b2(self): 

return nn.Sequential ( 
n.LazyConv2d(64, kernel_size=1), nn.ReLU(), 
n.LazyConv2d(192, kernel_size=3, padding=1), nn.ReLU(), 
n.MaxPool2d(kernel_size=3, stride=2, padding=1)) 


= a= = | 


The third module connects two complete Inception blocks in series. The number of output 
channels of the first Inception block is 64 + 128 +32 +32 = 256. This amounts to a ratio of 
the number of output channels among the four branches of 2 : 4: 1 : 1. To achieve this, we 
first reduce the input dimensions by 5 and by b in the second and third branch respectively 
to arrive at 96 = 192/2 and 16 = 192/12 channels respectively. 


The number of output channels of the second Inception block is increased to 128 + 192 + 
96 + 64 = 480, yielding a ratio of 128 : 192 : 96 : 64 =4:6:3:2. As before, we need to 
reduce the number of intermediate dimensions in the second and third channel. A scale of 
5 and t respectively suffices, yielding 128 and 32 channels respectively. This is captured 
by the arguments of the following Inception block constructors. 


@d21.add_to_class(GoogleNet) 
def b3(self): 
return nn.Sequential(Inception(64, (96, 128), (16, 32), 32), 
Inception(128, (128, 192), (32, 96), 64), 
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) 


The fourth module is more complicated. It connects five Inception blocks in series, and 
they have 192+ 208 +48 +64 = 512, 160+224+64+64 = 512, 128+256+64+64 = 512, 
112+ 288 + 64+ 64 = 528, and 256 + 320 + 128 + 128 = 832 output channels, respectively. 
The number of channels assigned to these branches is similar to that in the third module: 
the second branch with the 3 x3 convolutional layer outputs the largest number of channels, 
followed by the first branch with only the 1 x 1 convolutional layer, the third branch with 
the 5 x 5 convolutional layer, and the fourth branch with the 3 x 3 max-pooling layer. The 
second and third branches will first reduce the number of channels according to the ratio. 
These ratios are slightly different in different Inception blocks. 
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@d21.add_to_class(GoogleNet) 
def b4(self): 
return nn.Sequential(Inception(192, (96, 208), (16, 48), 64), 

Inception(160, (112, 224), (24, 64), 64), 
Inception(128, (128, 256), (24, 64), 64), 
Inception(112, (144, 288), (32, 64), 64), 
Inception(256, (160, 320), (32, 128), 128), 
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) 


The fifth module has two Inception blocks with 256+320+4 128+ 128 = 832 and 384+384+ 
128 + 128 = 1024 output channels. The number of channels assigned to each branch is the 
same as that in the third and fourth modules, but differs in specific values. It should be 
noted that the fifth block is followed by the output layer. This block uses the global average 
pooling layer to change the height and width of each channel to 1, just as in NiN. Finally, 
we turn the output into a two-dimensional array followed by a fully connected layer whose 
number of outputs is the number of label classes. 


@d21.add_to_class(GoogleNet) 
def b5(self): 
return nn.Sequential(Inception(256, (160, 320), (32, 128), 128), 
Inception(384, (192, 384), (48, 128), 128), 
nn.AdaptiveAvgPool2d((1,1)), nn.Flatten()) 


Now that we defined all blocks b1 through b5, it is just a matter of assembling them all into 
a full network. 


@d21.add_to_class(GoogleNet) 
def __init__(self, 1r=0.1, num_classes=10): 
super(GoogleNet, self).__init__Q 
self .save_hyperparameters() 
self.net = nn.Sequential(self.b1(), self.b2Q), self.b3Q), self.b4Q), 
self.b5(), nn.LazyLinear(num_classes)) 
self .net.apply(d21.init_cnn) 


The GoogLeNet model is computationally complex. Note the large number of relatively 
arbitrary hyperparameters in terms of the number of channels chosen, the number of blocks 
prior to dimensionality reduction, the relative partitioning of capacity across channels, etc. 
Much of it is due to the fact that at the ttme when GoogLeNet was introduced, automatic 
tools for network definition or design exploration were not yet available. For instance, by 
now we take it for granted that a competent deep learning framework is capable of inferring 
dimensionalities of input tensors automatically. At the time, many such configurations had 
to be specified explicitly by the experimenter, thus often slowing down active experimen- 
tation. Moreover, the tools needed for automatic exploration were still in flux and initial 
experiments largely amounted to costly brute-force exploration, genetic algorithms, and 
similar strategies. 


For now the only modification we will carry out is to reduce the input height and width 
from 224 to 96 to have a reasonable training time on Fashion-MNIST. This simplifies the 
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computation. Let’s have a look at the changes in the shape of the output between the various 
modules. 


model = GoogleNet().layer_summary((1, 1, 96, 96)) 


Sequential output shape: torch.Size([1, 64, 24, 24]) 
Sequential output shape: torch.Size([1, 192, 12, 12]) 
Sequential output shape: torch.Size([1, 480, 6, 6]) 
Sequential output shape: torch.Size([1, 832, 3, 3]) 
Sequential output shape: torch.Size([1, 1024]) 

Linear output shape: torch.Size([1, 10]) 


8.4.3 Training 


As before, we train our model using the Fashion-MNIST dataset. We transform it to 96 x 96 
pixel resolution before invoking the training procedure. 


model = GoogleNet(1r=0. 01) 

trainer = d21.Trainer(max_epochs=12, num_gpus=1) 

data = d21.FashionMNIST(batch_size=128, resize=(96, 96)) 

model. apply_init([next(iter(data. get_dataloader(True)))[0]], d21.init_cnn) 
trainer.fit(model, data) 


— train_loss 
2.04 --- val_loss 
—-- val_acc 
1.54 = 
1.04 
0.54 VA = 
BOO A 
0.0 T T T T 
0 2 4 6 8 10 
epoch 


8.4.4 Discussion 


A key feature of GoogLeNet is that it is actually cheaper to compute than its predecessors 
while simultaneously providing improved accuracy. This marks the beginning of a much 
more deliberate network design that trades off the cost of evaluating a network with a reduc- 
tion in errors. It also marks the beginning of experimentation at a block level with network 
design hyperparameters, even though it was entirely manual at the time. We will revisit this 
topic in Section 8.8 when discussing strategies for network structure exploration. 


Over the following sections we will encounter a number of design choices (e.g., batch nor- 
malization, residual connections, and channel grouping) that allow us to improve networks 
significantly. For now, you can be proud to have implemented what is arguably the first 
truly modern CNN. 
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8.4.5 Exercises 


. GoogLeNet was so successful that it went through a number of iterations, progressively 


improving speed and accuracy. Try to implement and run some of them. They include 
the following: 


1. Add a batch normalization layer (Ioffe and Szegedy, 2015), as described later in 
Section 8.5. 


2. Make adjustments to the Inception block (width, choice and order of convolutions), 
as described in Szegedy et al. (2016). 


3. Use label smoothing for model regularization, as described in Szegedy et al. (2016). 


4. Make further adjustments to the Inception block by adding residual connection (Szegedy 


et al., 2017), as described later in Section 8.6. 


. What is the minimum image size needed for GoogLeNet to work? 


. Can you design a variant of GoogLeNet that works on Fashion-MNIST’s native resolu- 


tion of 28 x 28 pixels? How would you need to change the stem, the body, and the head 
of the network, if anything at all? 


. Compare the model parameter sizes of AlexNet, VGG, NiN, and GoogLeNet. How do 


the latter two network architectures significantly reduce the model parameter size? 


. Compare the amount of computation needed in GoogLeNet and AlexNet. How does this 


affect the design of an accelerator chip, e.g., in terms of memory size, memory band- 
width, cache size, the amount of computation, and the benefit of specialized operations? 


31 


8.5 Batch Normalization 


Training deep neural networks is difficult. Getting them to converge in a reasonable amount 
of time can be tricky. In this section, we describe batch normalization, a popular and 
effective technique that consistently accelerates the convergence of deep networks (Ioffe 
and Szegedy, 2015). Together with residual blocks—covered later in Section 8.6—batch 
normalization has made it possible for practitioners to routinely train networks with over 
100 layers. A secondary (serendipitous) benefit of batch normalization lies in its inherent 
regularization. 


import torch 
from torch import nn 
from d21 import torch as d21 
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8.5.1 Training Deep Networks 


When working with data, we often preprocess before training. Choices regarding data pre- 
processing often make an enormous difference in the final results. Recall our application of 
MLPs to predicting house prices (Section 5.7). Our first step when working with real data 
was to standardize our input features to have zero mean p = 0 and unit variance È = 1 across 
multiple observations (Friedman, 1987), frequently rescaling the latter so that the diagonal 
is unity, i.e., X;; = 1. Yet another strategy is to rescale vectors to unit length, possibly 
zero mean per observation. This can work well, e.g., for spatial sensor data. These pre- 
processing techniques and many others, are beneficial for keeping the estimation problem 
well controlled. For a review of feature selection and extraction see the article of Guyon et 
al. (2008), for example. Standardizing vectors also has the nice side-effect of constraining 
the function complexity of functions that act upon it. For instance, the celebrated radius- 
margin bound (Vapnik, 1995) in support vector machines and the Perceptron Convergence 
Theorem (Novikoff, 1962) rely on inputs of bounded norm. 


Intuitively, this standardization plays nicely with our optimizers since it puts the parameters 
a priori on a similar scale. As such, it is only natural to ask whether a corresponding 
normalization step inside a deep network might not be beneficial. While this is not quite 
the reasoning that led to the invention of batch normalization (Ioffe and Szegedy, 2015), 
it is a useful way of understanding it and its cousin, layer normalization (Ba et al., 2016), 
within a unified framework. 


Second, for a typical MLP or CNN, as we train, the variables in intermediate layers (e.g., 
affine transformation outputs in MLP) may take values with widely varying magnitudes: 
whether along the layers from input to output, across units in the same layer, and over 
time due to our updates to the model parameters. The inventors of batch normalization 
postulated informally that this drift in the distribution of such variables could hamper the 
convergence of the network. Intuitively, we might conjecture that if one layer has variable 
activations that are 100 times that of another layer, this might necessitate compensatory 
adjustments in the learning rates. Adaptive solvers such as AdaGrad (Duchi et al., 2011), 
Adam (Kingma and Ba, 2014), Yogi (Zaheer et al., 2018), or Distributed Shampoo (Anil 
et al., 2020) aim to address this from the viewpoint of optimization, e.g., by adding aspects 
of second-order methods. The alternative is to prevent the problem from occurring, simply 
by adaptive normalization. 


Third, deeper networks are complex and tend to be more liable to overfitting. This means 
that regularization becomes more critical. A common technique for regularization is noise 
injection. This has been known for a long time, e.g., with regard to noise injection for 
the inputs (Bishop, 1995). It also forms the basis of dropout in Section 5.6. As it turns 
out, quite serendipitously, batch normalization conveys all three benefits: preprocessing, 
numerical stability, and regularization. 


Batch normalization is applied to individual layers, or optionally, to all of them: In each 
training iteration, we first normalize the inputs (of batch normalization) by subtracting their 
mean and dividing by their standard deviation, where both are estimated based on the statis- 
tics of the current minibatch. Next, we apply a scale coefficient and an offset to recover the 
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lost degrees of freedom. It is precisely due to this normalization based on batch statistics 
that batch normalization derives its name. 


Note that if we tried to apply batch normalization with minibatches of size 1, we would not 
be able to learn anything. That is because after subtracting the means, each hidden unit 
would take value 0. As you might guess, since we are devoting a whole section to batch 
normalization, with large enough minibatches the approach proves effective and stable. 
One takeaway here is that when applying batch normalization, the choice of batch size is 
even more significant than without batch normalization, or at least, suitable calibration is 
needed as we might adjust batch size. 


Denote by $ a minibatch and let x € $ be an input to batch normalization (BN). In this 
case the batch normalization is defined as follows: 

BN(x) = y o ~_2 +p. (8.5.1) 

Og 

In (8.5.1), fig is the sample mean and & g is the sample standard deviation of the minibatch 
B. After applying standardization, the resulting minibatch has zero mean and unit variance. 
The choice of unit variance (rather than some other magic number) is arbitrary. We recover 
this degree of freedom by including an elementwise scale parameter y and shift parameter 
£ that have the same shape as x. Both are parameters that need to be learned as part of 
model training. 


The variable magnitudes for intermediate layers cannot diverge during training since batch 
normalization actively centers and rescales them back to a given mean and size (via ftg and 
g). Practical experience confirms that, as alluded to when discussing feature rescaling, 
batch normalization seems to allow for more aggressive learning rates. We calculate fig 
and ôg in (8.5.1) as follows: 


1 
Beas 2 _ > lg) 
Îs = 5 > Aen 5 =i (x — fig) +e. (8.5.2) 


Note that we add a small constant € > 0 to the variance estimate to ensure that we never 
attempt division by zero, even in cases where the empirical variance estimate might be 
very small or vanish. The estimates fig and 6 g counteract the scaling issue by using noisy 
estimates of mean and variance. You might think that this noisiness should be a problem. 
On the contrary, it is actually beneficial. 


This turns out to be a recurring theme in deep learning. For reasons that are not yet well- 
characterized theoretically, various sources of noise in optimization often lead to faster 
training and less overfitting: this variation appears to act as a form of regularization. Teye 
et al. (2018) and Luo etal. (2018) related the properties of batch normalization to Bayesian 
priors and penalties, respectively. In particular, this sheds some light on the puzzle of why 
batch normalization works best for moderate minibatch sizes in the 50-100 range. This 
particular size of minibatch seems to inject just the “right amount” of noise per layer, both 
in terms of scale via &, and in terms of offset via fa: a larger minibatch regularizes less due 
to the more stable estimates, whereas tiny minibatches destroy useful signal due to high 
variance. Exploring this direction further, considering alternative types of preprocessing 
and filtering may yet lead to other effective types of regularization. 


295 


Batch Normalization 


Fixing a trained model, you might think that we would prefer using the entire dataset to 
estimate the mean and variance. Once training is complete, why would we want the same 
image to be classified differently, depending on the batch in which it happens to reside? 
During training, such exact calculation is infeasible because the intermediate variables for 
all data examples change every time we update our model. However, once the model is 
trained, we can calculate the means and variances of each layer’s variables based on the 
entire dataset. Indeed this is standard practice for models employing batch normalization; 
thus batch normalization layers function differently in training mode (normalizing by mini- 
batch statistics) than in prediction mode (normalizing by dataset statistics). In this form they 
closely resemble the behavior of dropout regularization of Section 5.6, where noise is only 
injected during training. 


8.5.2 Batch Normalization Layers 


Batch normalization implementations for fully connected layers and convolutional layers 
are slightly different. One key difference between batch normalization and other layers is 
that because the former operates on a full minibatch at a time, we cannot just ignore the 
batch dimension as we did before when introducing other layers. 


Fully Connected Layers 


When applying batch normalization to fully connected layers, Ioffe and Szegedy (2015), in 
their original paper inserted batch normalization after the affine transformation and before 
the nonlinear activation function. Later applications experimented with inserting batch 
normalization right after activation functions. Denoting the input to the fully connected 
layer by x, the affine transformation by Wx + b (with the weight parameter W and the 
bias parameter b), and the activation function by ¢, we can express the computation of a 
batch-normalization-enabled, fully connected layer output h as follows: 


h = (BN(Wx+ b)). (8.5.3) 


Recall that mean and variance are computed on the same minibatch on which the transfor- 
mation is applied. 


Convolutional Layers 


Similarly, with convolutional layers, we can apply batch normalization after the convolution 
but before the nonlinear activation function. The key difference from batch normalization 
in fully connected layers is that we apply the operation on a per-channel basis across all 
locations. This is compatible with our assumption of translation invariance that led to 
convolutions: we assumed that the specific location of a pattern within an image was not 
critical for the purpose of understanding. 


Assume that our minibatches contain m examples and that for each channel, the output 
of the convolution has height p and width q. For convolutional layers, we carry out each 
batch normalization over the m - p - q elements per output channel simultaneously. Thus, 
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we collect the values over all spatial locations when computing the mean and variance and 
consequently apply the same mean and variance within a given channel to normalize the 
value at each spatial location. Each channel has its own scale and shift parameters, both of 
which are scalars. 


Layer Normalization 


Note that in the context of convolutions the batch normalization is well defined even for 
minibatches of size 1: after all, we have all the locations across an image to average. Con- 
sequently, mean and variance are well defined, even if it is just within a single observation. 
This consideration led Ba et al. (2016) to introduce the notion of layer normalization. It 
works just like a batch norm, only that it is applied to one observation at a time. Conse- 
quently both the offset and the scaling factor are scalars. For an n-dimensional vector x, 
layer norms are given by 


A 


x > LN(x) = £, (8.5.4) 
ô 
where scaling and offset are applied coefficient-wise and given by 


aa ai - A +e. (8.5.5) 
n g am 

As before we add a small offset € > 0 to prevent division by zero. One of the major benefits 

of using layer normalization is that it prevents divergence. After all, ignoring e, the output 

of the layer normalization is scale independent. That is, we have LN(x) ~ LN(«x) for any 

choice of a + 0. This becomes an equality for |a| — oo (the approximate equality is due 


to the offset € for the variance). 


Another advantage of the layer normalization is that it does not depend on the minibatch 
size. It is also independent of whether we are in training or test regime. In other words, it is 
simply a deterministic transformation that standardizes the activations to a given scale. This 
can be very beneficial in preventing divergence in optimization. We skip further details and 
recommend that interested readers consult the original paper. 


Batch Normalization During Prediction 


As we mentioned earlier, batch normalization typically behaves differently in training mode 
than in prediction mode. First, the noise in the sample mean and the sample variance arising 
from estimating each on minibatches is no longer desirable once we have trained the model. 
Second, we might not have the luxury of computing per-batch normalization statistics. For 
example, we might need to apply our model to make one prediction at a time. 


Typically, after training, we use the entire dataset to compute stable estimates of the vari- 
able statistics and then fix them at prediction time. Hence, batch normalization behaves 
differently during training than at test time. Recall that dropout also exhibits this charac- 
teristic. 
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8.5.3 Implementation from Scratch 


To see how batch normalization works in practice, we implement one from scratch be- 
low. 


def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum): 
# Use is_grad_enabled to determine whether we are in training mode 
if not torch.is_grad_enabled(): 
# In prediction mode, use mean and variance obtained by moving average 
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps) 
else: 
assert len(X.shape) in (2, 4) 
if len(X.shape) == 2: 
# When using a fully connected layer, calculate the mean and 
# variance on the feature dimension 
mean = X.mean(dim=Q) 
var = ((X - mean) ** 2).mean(dim=0) 
else: 
# When using a two-dimensional convolutional layer, calculate the 
# mean and variance on the channel dimension (axis=1). Here we 
# need to maintain the shape of X, so that the broadcasting 
# operation can be carried out later 
mean = X.mean(dim=(@, 2, 3), keepdim=True) 
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True) 
# In training mode, the current mean and variance are used 
X_hat = (X - mean) / torch.sqrt(var + eps) 
# Update the mean and variance using moving average 
moving_mean = (1.0 - momentum) * moving_mean + momentum * mean 
moving_var = (1.0 - momentum) * moving_var + momentum * var 
Y = gamma * X_hat + beta # Scale and shift 
return Y, moving_mean.data, moving_var.data 


We can now create a proper BatchNorm layer. Our layer will maintain proper parameters for 
scale gamma and shift beta, both of which will be updated in the course of training. Addi- 
tionally, our layer will maintain moving averages of the means and variances for subsequent 
use during model prediction. 


Putting aside the algorithmic details, note the design pattern underlying our implementation 
of the layer. Typically, we define the mathematics in a separate function, say batch_norm. 
We then integrate this functionality into a custom layer, whose code mostly addresses book- 
keeping matters, such as moving data to the right device context, allocating and initializing 
any required variables, keeping track of moving averages (here for mean and variance), 
and so on. This pattern enables a clean separation of mathematics from boilerplate code. 
Also note that for the sake of convenience we did not worry about automatically inferring 
the input shape here; thus we need to specify the number of features throughout. By now 
all modern deep learning frameworks offer automatic detection of size and shape in the 
high-level batch normalization APIs (in practice we will use this instead). 


class BatchNorm(nn.Module) : 
# num_features: the number of outputs for a fully connected layer or the 
# number of output channels for a convolutional layer. num_dims: 2 for a 


(continues on next page) 
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# fully connected layer and 4 for a convolutional layer 


def 


def 


__init__(self, num_features, num_dims): 
super().__init__() 
if num_dims == 2: 
shape = (1, num_features) 
else: 


shape = (1, num_features, 1, 1) 
# The scale parameter and the shift parameter (model parameters) are 
# initialized to 1 and @, respectively 
self.gamma = nn.Parameter(torch.ones(shape) ) 
self.beta = nn.Parameter(torch.zeros(shape) ) 
# The variables that are not model parameters are initialized to @ and 
#1 
self .moving_mean = torch.zeros(shape) 
self .moving_var = torch.ones(shape) 


forward(self, X): 

# If X is not on the main memory, copy moving_mean and moving_var to 

# the device where X is located 

if self.moving_mean.device != X.device: 
self .moving_mean = self .moving_mean.to(X.device) 
self.moving_var = self .moving_var.to(X.device) 

# Save the updated moving_mean and moving_var 

Y, self.moving_mean, self.moving_var = batch_norm( 
X, self.gamma, self.beta, self.moving_mean, 
self.moving_var, eps=le-5, momentum=0.1) 

return y. 


We used momentum to govern the aggregation over past mean and variance estimates. This 
is somewhat of a misnomer as it has nothing whatsoever to do with the momentum term of 
optimization. Nonetheless, it is the commonly adopted name for this term and in deference 
to API naming convention we use the same variable name in our code. 


8.5.4 LeNet with Batch Normalization 


To see how to apply BatchNorm in context, below we apply it to a traditional LeNet model 
(Section 7.6). Recall that batch normalization is applied after the convolutional layers or 
fully connected layers but before the corresponding activation functions. 


class BNLeNetScratch(d21.Classifier): 


def 


__init__(self, 1r=0.1, num_classes=10): 

super().__init__Q 

self .save_hyperparameters() 

self.net = nn.Sequential ( 
nn.LazyConv2d(6, kernel_size=5), BatchNorm(6, num_dims=4), 
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), 
nn.LazyConv2d(16, kernel_size=5), BatchNorm(16, num_dims=4) , 
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), 
nn.Flatten(), nn.LazyLinear(120), 
BatchNorm(120, num_dims=2), nn.Sigmoid(), nn.LazyLinear (84) , 
BatchNorm(84, num_dims=2), nn.Sigmoid(), 
nn.LazyLinear (num_classes) ) 
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As before, we will train our network on the Fashion-MNIST dataset. This code is virtually 
identical to that when we first trained LeNet. 


trainer = d21.Trainer(max_epochs=10, num_gpus=1) 

data = d21.FashionMNIST (batch_size=128) 

model = BNLeNetScratch(1r=@.1) 

model. apply_init([next(iter(data.get_dataloader(True)))[0]], d21.init_cnn) 
trainer.fit(model, data) 


084) —— 
— train_loss 
0.64 ==- val_loss 
s =:= val_acc 


Let’s have a look at the scale parameter gamma and the shift parameter beta learned from 
the first batch normalization layer. 


model .net[1].gamma.reshape((-1,)), model.netLl1].beta.reshape((-1,)) 


(tensor([1.4334, 1.9905, 1.8584, 2.0740, 2.0522, 1.8877], device='cuda:0', 
grad_fn=<ViewBackwardQ>) , 
tensor([ 0.7354, -1.3538, -@.2567, -@.9991, -@.3028, 1.3125], device='cuda:@ 


1 


>, 


grad_fn=<ViewBackwardQ>) ) 


8.5.5 Concise Implementation 


Compared with the BatchNorm class, which we just defined ourselves, we can use the 
BatchNorm class defined in high-level APIs from the deep learning framework directly. 
The code looks virtually identical to our implementation above, except that we no longer 
need to provide additional arguments for it to get the dimensions right. 


class BNLeNet(d21.Classifier): 
def __init__(self, lr=0.1, num_classes=10): 

süper Or uae O 

self.save_hyperparameters() 

self.net = nn.Sequential( 
nn.LazyConv2d(6, kernel_size=5), nn.LazyBatchNorm2d(), 
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), 
nn.LazyConv2d(16, kernel_size=5), nn.LazyBatchNorm2d(), 
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), 
nn.Flatten(), nn.LazyLinear(120), nn.LazyBatchNormld(), 


(continues on next page) 
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nn.Sigmoid(), nn.LazyLinear(84), nn.LazyBatchNormld(), 
nn.Sigmoid(), nn.LazyLinear(num_classes) ) 


Below, we use the same hyperparameters to train our model. Note that as usual, the high- 
level API variant runs much faster because its code has been compiled to C++ or CUDA 
while our custom implementation must be interpreted by Python. 


trainer = d21.Trainer(max_epochs=10, num_gpus=1) 

data = d21.FashionMNIST (batch_size=128) 

model = BNLeNet(1r=@.1) 

model .apply_init([next(iter(data.get_dataloader(True)))[0]], d21.init_cnn) 
trainer.fit(model, data) 


0.8 {4 \ =A 
— train_loss 
==> val_loss 
0.6 5 VS ae z 
” val_acc 
0.44 
0 2 4 6 8 10 


8.5.6 Discussion 


Intuitively, batch normalization is thought to make the optimization landscape smoother. 
However, we must be careful to distinguish between speculative intuitions and true expla- 
nations for the phenomena that we observe when training deep models. Recall that we do 
not even know why simpler deep neural networks (MLPs and conventional CNNs) general- 
ize well in the first place. Even with dropout and weight decay, they remain so flexible that 
their ability to generalize to unseen data likely needs significantly more refined learning- 
theoretic generalization guarantees. 


The original paper proposing batch normalization (Ioffe and Szegedy, 2015), in addition to 
introducing a powerful and useful tool, offered an explanation for why it works: by reducing 
internal covariate shift. Presumably by internal covariate shift they meant something like 
the intuition expressed above—the notion that the distribution of variable values changes 
over the course of training. However, there were two problems with this explanation: i) This 
drift is very different from covariate shift, rendering the name a misnomer. If anything, it 
is closer to concept drift. ii) The explanation offers an under-specified intuition but leaves 
the question of why precisely this technique works an open question wanting for a rigorous 
explanation. Throughout this book, we aim to convey the intuitions that practitioners use to 
guide their development of deep neural networks. However, we believe that it is important 
to separate these guiding intuitions from established scientific fact. Eventually, when you 
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master this material and start writing your own research papers you will want to be clear to 
delineate between technical claims and hunches. 


Following the success of batch normalization, its explanation in terms of internal covariate 
shift has repeatedly surfaced in debates in the technical literature and broader discourse 
about how to present machine learning research. In a memorable speech given while ac- 
cepting a Test of Time Award at the 2017 NeurIPS conference, Ali Rahimi used internal 
covariate shift as a focal point in an argument likening the modern practice of deep learning 
to alchemy. Subsequently, the example was revisited in detail in a position paper outlining 
troubling trends in machine learning (Lipton and Steinhardt, 2018). Other authors have 
proposed alternative explanations for the success of batch normalization, some (Santurkar 
et al., 2018) claiming that batch normalization’s success comes despite exhibiting behavior 
that is in some ways opposite to those claimed in the original paper. 


We note that the internal covariate shift is no more worthy of criticism than any of thou- 
sands of similarly vague claims made every year in the technical machine learning literature. 
Likely, its resonance as a focal point of these debates owes to its broad recognizability for 
the target audience. Batch normalization has proven an indispensable method, applied in 
nearly all deployed image classifiers, earning the paper that introduced the technique tens of 
thousands of citations. We conjecture, though, that the guiding principles of regularization 
through noise injection, acceleration through rescaling and lastly preprocessing may well 
lead to further inventions of layers and techniques in the future. 


On a more practical note, there are a number of aspects worth remembering about batch 
normalization: 


e During model training, batch normalization continuously adjusts the intermediate output 
of the network by utilizing the mean and standard deviation of the minibatch, so that 
the values of the intermediate output in each layer throughout the neural network are 
more stable. 


Batch normalization is slightly different for fully connected layers than for convolutional 
layers. In fact, for convolutional layers, layer normalization can sometimes be used as 
an alternative. 


Like a dropout layer, batch normalization layers have different behaviors in training mode 
than in prediction mode. 


Batch normalization is useful for regularization and improving convergence in optimiza- 
tion. By contrast, the original motivation of reducing internal covariate shift seems 
not to be a valid explanation. 


e For more robust models that are less sensitive to input perturbations, consider removing 
batch normalization (Wang et al., 2022). 


8.5.7 Exercises 


1. Should we remove the bias parameter from the fully connected layer or the convolutional 
layer before the batch normalization? Why? 
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2. Compare the learning rates for LeNet with and without batch normalization. 

1. Plot the increase in validation accuracy. 

2. How large can you make the learning rate before the optimization fails in both cases? 
3. Do we need batch normalization in every layer? Experiment with it. 


4. Implement a “lite” version of batch normalization that only removes the mean, or alter- 
natively one that only removes the variance. How does it behave? 


5. Fix the parameters beta and gamma. Observe and analyze the results. 
6. Can you replace dropout by batch normalization? How does the behavior change? 
7. Research ideas: think of other normalization transforms that you can apply: 
1. Can you apply the probability integral transform? 
2. Can you use a full-rank covariance estimate? Why should you probably not do that? 


3. Can you use other compact matrix variants (block-diagonal, low-displacement rank, 
Monarch, etc.)? 


4. Does a sparsification compression act as a regularizer? 


5. Are there other projections (e.g., convex cone, symmetry group-specific transforms) 
that you can use? 


Discussions !3?. 


8.6 Residual Networks (ResNet) and ResNeXt 
Sr | 


As we design ever deeper networks it becomes imperative to understand how adding layers 
can increase the complexity and expressiveness of the network. Even more important is 
the ability to design networks where adding layers makes networks strictly more expressive 
rather than just different. To make some progress we need a bit of mathematics. 


import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


8.6.1 Function Classes 


Consider F, the class of functions that a specific network architecture (together with learn- 
ing rates and other hyperparameter settings) can reach. That is, for all f € F there exists 
some set of parameters (e.g., weights and biases) that can be obtained through training on 
a suitable dataset. Let’s assume that f* is the “truth” function that we really would like to 
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find. If it is in F, we are in good shape but typically we will not be quite so lucky. Instead, 
we will try to find some SR which is our best bet within F. For instance, given a dataset 
with features X and labels y, we might try finding it by solving the following optimization 
problem: 


fé © argmin L(X, y, f) subject to f € F. (8.6.1) 
f 


We know that regularization (Morozov, 1984, Tikhonov and Arsenin, 1977) may control 
complexity of F and achieve consistency, so a larger size of training data generally leads to 
better fF It is only reasonable to assume that if we design a different and more powerful 
architecture F” we should arrive at a better outcome. In other words, we would expect 
that fj, is “better” than f4. However, if F ¢ F” there is no guarantee that this should 
even happen. In fact, f, might well be worse. As illustrated by Fig. 8.6.1, for non-nested 
function classes, a larger function class does not always move closer to the “truth” function 
f“. For instance, on the left of Fig. 8.6.1, though #3 is closer to f* than Fi, Fe moves away 
and there is no guarantee that further increasing the complexity can reduce the distance 
from f*. With nested function classes where Fi C --- C Fe on the right of Fig. 8.6.1, we 
can avoid the aforementioned issue from the non-nested function classes. 


Non-nested function classes Nested function classes 


For non-nested function classes, a larger (indicated by area) function class does not 
guarantee we will get closer to the “truth” function (f*). This does not happen in nested 
function classes. 


Thus, only if larger function classes contain the smaller ones are we guaranteed that increas- 
ing them strictly increases the expressive power of the network. For deep neural networks, 
if we can train the newly-added layer into an identity function f(x) = x, the new model 
will be as effective as the original model. As the new model may get a better solution to fit 
the training dataset, the added layer might make it easier to reduce training errors. 


This is the question that He et al. (2016) considered when working on very deep com- 
puter vision models. At the heart of their proposed residual network (ResNet) is the idea 
that every additional layer should more easily contain the identity function as one of its 
elements. These considerations are rather profound but they led to a surprisingly simple 
solution, a residual block. With it, ResNet won the ImageNet Large Scale Visual Recogni- 
tion Challenge in 2015. The design had a profound influence on how to build deep neural 
networks. For instance, residual blocks have been added to recurrent networks (Kim et al., 
2017, Prakash et al., 2016). Likewise, Transformers (Vaswani et al., 2017) use them to 
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stack many layers of networks efficiently. It is also used in graph neural networks (Kipf 
and Welling, 2016) and, as a basic concept, it has been used extensively in computer vision 
(Redmon and Farhadi, 2018, Ren et al., 2015). Note that residual networks are predated by 
highway networks (Srivastava et al., 2015) that share some of the motivation, albeit without 
the elegant parametrization around the identity function. 


8.6.2 Residual Blocks 


Let’s focus on a local part of a neural network, as depicted in Fig. 8.6.2. Denote the input 
by x. We assume that f(x), the desired underlying mapping we want to obtain by learning, 
is to be used as input to the activation function on the top. On the left, the portion within the 
dotted-line box must directly learn f(x). On the right, the portion within the dotted-line 
box needs to learn the residual mapping g(x) = f(x) — x, which is how the residual block 
derives its name. If the identity mapping f(x) = x is the desired underlying mapping, the 
residual mapping amounts to g(x) = 0 and it is thus easier to learn: we only need to push the 
weights and biases of the upper weight layer (e.g., fully connected layer and convolutional 
layer) within the dotted-line box to zero. The right figure illustrates the residual block of 
ResNet, where the solid line carrying the layer input x to the addition operator is called 
a residual connection (or shortcut connection). With residual blocks, inputs can forward 
propagate faster through the residual connections across layers. In fact, the residual block 
can be thought of as a special case of the multi-branch Inception block: it has two branches 
one of which is the identity mapping. 


Activation function 


Activation function 


SS) = 8) +x 


gx) 


Weight layer Weight layer 


1 i 1 
1 l 1 
1 l 1 
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i Activation function 1 i Activation function 
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In a regular block (left), the portion within the dotted-line box must directly learn the 
mapping f(x). In a residual block (right), the portion within the dotted-line box needs to 
learn the residual mapping g(x) = f (x) — x, making the identity mapping f(x) = x easier 
to learn. 


ResNet has VGG’s full 3 x 3 convolutional layer design. The residual block has two 3 x 3 
convolutional layers with the same number of output channels. Each convolutional layer 
is followed by a batch normalization layer and a ReLU activation function. Then, we skip 
these two convolution operations and add the input directly before the final ReLU activation 
function. This kind of design requires that the output of the two convolutional layers has to 
be of the same shape as the input, so that they can be added together. If we want to change 
the number of channels, we need to introduce an additional 1 x 1 convolutional layer to 
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transform the input into the desired shape for the addition operation. Let’s have a look at 
the code below. 


class Residual(nn.Module): #@save 
"""The Residual block of ResNet models.”"" 
def __init__(self, num_channels, use_1xlconv=False, strides=1): 
super().__init__Q 
self.convl = nn.LazyConv2d(num_channels, kernel_size=3, padding=1, 
stride=strides) 
self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1) 
if use_1xlconv: 
self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, 
stride=strides) 
else: 
self.conv3 = None 
self.bn1 = nn.LazyBatchNorm2d() 
self.bn2 = nn.LazyBatchNorm2d() 


def forward(self, X): 
Y = F.relu(self.bn1(self.conv1(X))) 
Y = self.bn2(self.conv2(Y)) 
if self.conv3: 
X = self.conv3(X) 
Y += X 
return F.relu(Y) 


This code generates two types of networks: one where we add the input to the output before 
applying the ReLU nonlinearity whenever use_1x1conv=False; and one where we adjust 
channels and resolution by means of a 1 x 1 convolution before adding. Fig. 8.6.3 illustrates 
this. 


ee O 


> | ResNet block with and without 1 x 1 convolution, which transforms the input into the 
desired shape for the addition operation. 


Now let’s look at a situation where the input and output are of the same shape, where 1 x 1 
convolution is not needed. 
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blk = Residual (3) 
X = torch.randn(4, 3, 6, 6) 
b1k(X) . shape 


torch.Size([4, 3, 6, 6]) 


We also have the option to halve the output height and width while increasing the number 
of output channels. In this case we use 1 x 1 convolutions via use_1x1conv=True. This 
comes in handy at the beginning of each ResNet block to reduce the spatial dimensionality 
via strides=2. 


blk = Residual(6, use_1xlconv=True, strides=2) 
b1k(X) . shape 


torch.Size([4, 6, 3, 3]) 


8.6.3 ResNet Model 


The first two layers of ResNet are the same as those of the GoogLeNet we described before: 
the 7 x 7 convolutional layer with 64 output channels and a stride of 2 is followed by the 
3 x 3 max-pooling layer with a stride of 2. The difference is the batch normalization layer 
added after each convolutional layer in ResNet. 


class ResNet(d21.Classifier): 
def bl(self): 
return nn.Sequential( 
nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3), 
nn.LazyBatchNorm2d(), nn.ReLU(), 
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) 


GoogLeNet uses four modules made up of Inception blocks. However, ResNet uses four 
modules made up of residual blocks, each of which uses several residual blocks with the 
same number of output channels. The number of channels in the first module is the same 
as the number of input channels. Since a max-pooling layer with a stride of 2 has already 
been used, it is not necessary to reduce the height and width. In the first residual block for 
each of the subsequent modules, the number of channels is doubled compared with that of 
the previous module, and the height and width are halved. 


@d21.add_to_class(ResNet) 
def block(self, num_residuals, num_channels, first_block=False): 
blk = [] 
for i in range(num_residuals): 
if i == @ and not first_block: 
blk. append(Residual(num_channels, use_1xlconv=True, strides=2)) 
else: 


(continues on next page) 
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(continued from previous page) 


blk. append(Residual (num_channels)) 
return nn.Sequential (*blk) 


Then, we add all the modules to ResNet. Here, two residual blocks are used for each mod- 
ule. Lastly, just like GoogLeNet, we add a global average pooling layer, followed by the 
fully connected layer output. 


@d21.add_to_class(ResNet) 
def __init__(self, arch, 1r=0.1, num_classes=10): 
super(ResNet, self).__init__Q 
self .save_hyperparameters() 
self.net = nn.Sequential(self.b1()) 
for i, b in enumerate(arch): 
self.net.add_module(f'b{i+2}’, self.block(*b, first_block=(i==0))) 
self.net.add_module(’last’, nn.Sequential( 
nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(), 
nn.LazyLinear (num_classes) )) 
self .net.apply(d21.init_cnn) 


There are four convolutional layers in each module (excluding the 1 x 1 convolutional layer). 
Together with the first 7 x 7 convolutional layer and the final fully connected layer, there are 
18 layers in total. Therefore, this model is commonly known as ResNet-18. By configuring 
different numbers of channels and residual blocks in the module, we can create different 
ResNet models, such as the deeper 152-layer ResNet-152. Although the main architecture 
of ResNet is similar to that of GoogLeNet, ResNet’s structure is simpler and easier to mod- 
ify. All these factors have resulted in the rapid and widespread use of ResNet. Fig. 8.6.4 
depicts the full ResNet-18. 


AuoD J X Z 
wou y9}eg 


i The ResNet-18 architecture. 


Before training ResNet, let’s observe how the input shape changes across different modules 
in ResNet. As in all the previous architectures, the resolution decreases while the number 
of channels increases up until the point where a global average pooling layer aggregates all 
features. 


class ResNet18(ResNet) : 
def __init__(self, lr=0.1, num_classes=10): 
super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)), 
lr, num_classes) 
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ResNet18().layer_summary((1, 1, 96, 96)) 

Sequential output shape: torch.Size([1, 64, 24, 24]) 
Sequential output shape: torch.Size([1, 64, 24, 24]) 
Sequential output shape: torch.Size([1, 128, 12, 12]) 
Sequential output shape: torch.Size([1, 256, 6, 6]) 
Sequential output shape: torch.Size([1, 512, 3, 3]) 
Sequential output shape: torch.Size([1, 10]) 


8.6.4 Training 


We train ResNet on the Fashion-MNIST dataset, just like before. ResNet is quite a pow- 
erful and flexible architecture. The plot capturing training and validation loss illustrates a 
significant gap between both graphs, with the training loss being considerably lower. For 
a network of this flexibility, more training data would offer distinct benefit in closing the 
gap and improving accuracy. 


model = ResNet18(1r=0.01) 

trainer = d21.Trainer(max_epochs=12, num_gpus=1) 

data = d21.FashionMNIST(batch_size=128, resize=(96, 96)) 
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d21.init_cnn) 
trainer.fit(model, data) 


oo ee =——-— 
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8.6.5 ResNeXt 


One of the challenges one encounters in the design of ResNet is the trade-off between non- 
linearity and dimensionality within a given block. That is, we could add more nonlinearity 
by increasing the number of layers, or by increasing the width of the convolutions. An al- 
ternative strategy is to increase the number of channels that can carry information between 
blocks. Unfortunately, the latter comes with a quadratic penalty since the computational 
cost of ingesting cj channels and emitting co channels is proportional to O (ci - co) (see our 
discussion in Section 7.4). 


We can take some inspiration from the Inception block of Fig. 8.4.1 which has informa- 
tion flowing through the block in separate groups. Applying the idea of multiple indepen- 
dent groups to the ResNet block of Fig. 8.6.3 led to the design of ResNeXt (Xie et al., 


309 Residual Networks (ResNet) and ResNeXt 


2017). Different from the smorgasbord of transformations in Inception, ResNeXt adopts 
the same transformation in all branches, thus minimizing the need for manual tuning of 
each branch. 


c output channels c output channels 


b channels 


A 


b channels 


b intermediate channels 


b/g channels per group 


3x 3 Conv 3 x 3 Conv 


1x 1 Conv | 1x 1 Conv | 


1x 1 Conv 
ry 


1x 1 Conv 


g groups 


c input channels c input channels 


Simplified diagram 


The ResNeXt block. The use of grouped convolution with g groups is g times faster than 
a dense convolution. It is a bottleneck residual block when the number of intermediate 
channels 0 is less than c. 


Breaking up a convolution from ci to co channels into one of g groups of size ci/g gener- 
ating g outputs of size co/g is called, quite fittingly, a grouped convolution. The computa- 
tional cost (proportionally) is reduced from O (ci:co) to O(g:(ci/g)-(Co/g)) = O(Ci:Co/g), 
i.e., itis g times faster. Even better, the number of parameters needed to generate the output 
is also reduced from a cj X Co matrix to g smaller matrices of size (cj/g) X (Co/g), again a 
g times reduction. In what follows we assume that both ci and cy are divisible by g. 


The only challenge in this design is that no information is exchanged between the g groups. 
The ResNeXt block of Fig. 8.6.5 amends this in two ways: the grouped convolution with 
a 3 x 3 kernel is sandwiched in between two 1 x 1 convolutions. The second one serves 
double duty in changing the number of channels back. The benefit is that we only pay the 
O(c - b) cost for 1 x 1 kernels and can make do with an O(b?/g) cost for 3 x 3 kernels. 
Similar to the residual block implementation in Section 8.6.2, the residual connection is 
replaced (thus generalized) by a 1 x 1 convolution. 


The right-hand figure in Fig. 8.6.5 provides a much more concise summary of the resulting 
network block. It will also play a major role in the design of generic modern CNNs in 
Section 8.8. Note that the idea of grouped convolutions dates back to the implementation 
of AlexNet (Krizhevsky et al., 2012). When distributing the network across two GPUs 
with limited memory, the implementation treated each GPU as its own channel with no ill 
effects. 


The following implementation of the ResNeXtBlock class takes as argument groups (g), 
with bot_channel1s (b) intermediate (bottleneck) channels. Lastly, when we need to reduce 
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the height and width of the representation, we add a stride of 2 by setting use_1x1conv=True, 
strides=2. 


class ResNeXtBlock(nn.Module): #@save 
"""The ResNeXt block.””” 
def __init__(self, num_channels, groups, bot_mul, use_1xlconv=False, 
strides=1): 
süper On Init O 
bot_channels = int(round(num_channels * bot_mul)) 
self.convl = nn.LazyConv2d(bot_channels, kernel_size=1, stride=1) 
self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3, 
stride=strides, padding=1, 
groups=bot_channels//groups) 
self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, stride=1) 
self.bn1 = nn.LazyBatchNorm2d() 
self.bn2 = nn.LazyBatchNorm2d() 
self.bn3 = nn.LazyBatchNorm2d() 
if use_1xlconv: 
self.conv4 = nn.LazyConv2d(num_channels, kernel_size=1, 
stride=strides) 
self.bn4 = nn.LazyBatchNorm2d() 
else: 
self.conv4 = None 


def forward(self, X): 
Y = F.relu(self.bn1(self.conv1(X))) 
Y = F.relu(self.bn2(self.conv2(Y))) 
Y = self.bn3(self.conv3(Y)) 
if self.conv4: 
X = self. bn4(self.conv4(X)) 
return F.relu(y + X) 


oil 


Its use is entirely analogous to that of the ResNetBlock discussed previously. For instance, 
when using (use_lxlconv=False, strides=1), the input and output are of the same 
shape. Alternatively, setting use_lxlconv=True, strides=2 halves the output height 
and width. 


blk = ResNeXtBlock(32, 16, 1) 
X = torch.randn(4, 32, 96, 96) 
b1lk(X) . shape 


torch.Size([4, 32, 96, 96]) 


8.6.6 Summary and Discussion 


Nested function classes are desirable since they allow us to obtain strictly more power- 
ful rather than also subtly different function classes when adding capacity. One way of 
accomplishing this is by letting additional layers to simply pass through the input to the 
output. Residual connections allow for this. As a consequence, this changes the inductive 
bias from simple functions being of the form f(x) = 0 to simple functions looking like 


f(x) =x. 
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The residual mapping can learn the identity function more easily, such as pushing param- 
eters in the weight layer to zero. We can train an effective deep neural network by having 
residual blocks. Inputs can forward propagate faster through the residual connections across 
layers. As a consequence, we can thus train much deeper networks. For instance, the origi- 
nal ResNet paper (He et al., 2016) allowed for up to 152 layers. Another benefit of residual 
networks is that it allows us to add layers, initialized as the identity function, during the 
training process. After all, the default behavior of a layer is to let the data pass through 
unchanged. This can accelerate the training of very large networks in some cases. 


Prior to residual connections, bypassing paths with gating units were introduced to effec- 
tively train highway networks with over 100 layers (Srivastava et al., 2015). Using identity 
functions as bypassing paths, ResNet performed remarkably well on multiple computer vi- 
sion tasks. Residual connections had a major influence on the design of subsequent deep 
neural networks, of either convolutional or sequential nature. As we will introduce later, 
the Transformer architecture (Vaswani et al., 2017) adopts residual connections (together 
with other design choices) and is pervasive in areas as diverse as language, vision, speech, 
and reinforcement learning. 


ResNeXt is an example for how the design of convolutional neural networks has evolved 
over time: by being more frugal with computation and trading it off against the size of the 
activations (number of channels), it allows for faster and more accurate networks at lower 
cost. An alternative way of viewing grouped convolutions is to think of a block-diagonal 
matrix for the convolutional weights. Note that there are quite a few such “tricks” that lead 
to more efficient networks. For instance, ShiftNet (Wu et al., 2018) mimicks the effects of 
a3 x3 convolution, simply by adding shifted activations to the channels, offering increased 
function complexity, this time without any computational cost. 


A common feature of the designs we have discussed so far is that the network design is 
fairly manual, primarily relying on the ingenuity of the designer to find the “right” network 
hyperparameters. While clearly feasible, it is also very costly in terms of human time and 
there is no guarantee that the outcome is optimal in any sense. In Section 8.8 we will discuss 
a number of strategies for obtaining high quality networks in a more automated fashion. In 
particular, we will review the notion of network design spaces that led to the RegNetX/Y 
models (Radosavovic et al., 2020). 


8.6.7 Exercises 


1. What are the major differences between the Inception block in Fig. 8.4.1 and the residual 
block? How do they compare in terms of computation, accuracy, and the classes of 
functions they can describe? 


2. Refer to Table 1 in the ResNet paper (He et al., 2016) to implement different variants of 
the network. 


3. For deeper networks, ResNet introduces a “bottleneck” architecture to reduce model 
complexity. Try to implement it. 


4. In subsequent versions of ResNet, the authors changed the “convolution, batch normal- 
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ization, and activation” structure to the “batch normalization, activation, and convolu- 
tion” structure. Make this improvement yourself. See Figure 1 in He et al. (2016) for 
details. 


5. Why can’t we just increase the complexity of functions without bound, even if the func- 
tion classes are nested? 


Discussions 13, 


8.7 Densely Connected Networks (DenseNet) 
—SESESEeEeeEey EEE EEE SS 


ResNet significantly changed the view of how to parametrize the functions in deep net- 
works. DenseNet (dense convolutional network) is to some extent the logical extension of 
this (Huang et al., 2017). DenseNet is characterized by both the connectivity pattern where 
each layer connects to all the preceding layers and the concatenation operation (rather than 
the addition operator in ResNet) to preserve and reuse features from earlier layers. To un- 
derstand how to arrive at it, let’s take a small detour to mathematics. 


import torch 
from torch import nn 
from d21 import torch as d21 


8.7.1 From ResNet to DenseNet 


Recall the Taylor expansion for functions. At the point x = 0 it can be written as 


~ ae ro + [I]. (8.7.1) 


F(x) = f(O) +x: 


F'O) +x: 


The key point is that it decomposes a function into terms of increasingly higher order. In a 
similar vein, ResNet decomposes functions into 


f(x)=x+g(x). (8.7.2) 


That is, ResNet decomposes f into a simple linear term and a more complex nonlinear one. 
What if we wanted to capture (not necessarily add) information beyond two terms? One 
such solution is DenseNet (Huang et al., 2017). 


e] E 


| a] 


J The main difference between ResNet (left) and DenseNet (right) in cross-layer 


connections: use of addition and use of concatenation. 
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As shown in Fig. 8.7.1, the key difference between ResNet and DenseNet is that in the 
latter case outputs are concatenated (denoted by [,]) rather than added. As a result, we 
perform a mapping from x to its values after applying an increasingly complex sequence 
of functions: 


xX > Ix fix) ACs ACO). AA, Ax AD.. (8.7.3) 


In the end, all these functions are combined in MLP to reduce the number of features again. 
In terms of implementation this is quite simple: rather than adding terms, we concatenate 
them. The name DenseNet arises from the fact that the dependency graph between variables 
becomes quite dense. The final layer of such a chain is densely connected to all previous 
layers. The dense connections are shown in Fig. 8.7.2. 


È LEO > 


Dense connections in DenseNet. Note how the dimensionality increases with depth. 


L40) 


The main components that comprise a DenseNet are dense blocks and transition layers. The 
former define how the inputs and outputs are concatenated, while the latter control the num- 


ber of channels so that it is not too large, since the expansion x > [x, fi (x), fo ([x, fi (X)]),... 


can be quite high-dimensional. 


8.7.2 Dense Blocks 


DenseNet uses the modified “batch normalization, activation, and convolution” structure 
of ResNet (see the exercise in Section 8.6). First, we implement this convolution block 
structure. 


def conv_block(num_channels) : 
return nn.Sequential( 
nn.LazyBatchNorm2d(), nn.ReLU(), 
nn.LazyConv2d(num_channels, kernel_size=3, padding=1)) 


A dense block consists of multiple convolution blocks, each using the same number of 
output channels. In the forward propagation, however, we concatenate the input and output 
of each convolution block on the channel dimension. Lazy evaluation allows us to adjust 
the dimensionality automatically. 


class DenseBlock(nn.Module) : 
def __init__(self, num_convs, num_channels): 
super(DenseBlock, self).__init__( 
layer = [] 
for i in range(num_convs): 
layer .append(conv_block(num_channels) ) 
self.net = nn.Sequential («layer) 


(continues on next page) 
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(continued from previous page) 


def forward(self, X): 
for blk in self.net: 
Y = blk(X) 
# Concatenate input and output of each block along the channels 
X = torch.cat((X, Y), dim=1) 
return X 


In the following example, we define a DenseBlock instance with two convolution blocks of 
10 output channels. When using an input with three channels, we will get an output with 
3+ 10+ 10 = 23 channels. The number of convolution block channels controls the growth 
in the number of output channels relative to the number of input channels. This is also 
referred to as the growth rate. 


blk = DenseBlock(2, 10) 

X = torch.randn(4, 3, 8, 8) 
Y = blk(X) 

Y. shape 


torch.Size([4, 23, 8, 8]) 


8.7.3 Transition Layers 


Since each dense block will increase the number of channels, adding too many of them will 
lead to an excessively complex model. A transition layer is used to control the complexity 
of the model. It reduces the number of channels by using a | x 1 convolution. Moreover, it 
halves the height and width via average pooling with a stride of 2. 


def transition_block(num_channels) : 
return nn.Sequential( 
nn.LazyBatchNorm2d(), nn.ReLU(), 
nn.LazyConv2d(num_channels, kernel_size=1), 
nn.AvgPool2d(kernel_size=2, stride=2)) 


Apply a transition layer with 10 channels to the output of the dense block in the previous 
example. This reduces the number of output channels to 10, and halves the height and 
width. 


blk = transition_block(10) 
b1k(Y) .shape 


torch.Size([4, 10, 4, 4]) 


8.7.4 DenseNet Model 


315 


Densely Connected Networks (DenseNet) 


Next, we will construct a DenseNet model. DenseNet first uses the same single convolu- 
tional layer and max-pooling layer as in ResNet. 


class DenseNet(d21.Classifier): 
def bi(self): 
return nn. Sequential ( 
nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3), 
nn.LazyBatchNorm2d(), nn.ReLU(), 
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) 


Then, similar to the four modules made up of residual blocks that ResNet uses, DenseNet 
uses four dense blocks. As with ResNet, we can set the number of convolutional layers used 
in each dense block. Here, we set it to 4, consistent with the ResNet-18 model in Section 
8.6. Furthermore, we set the number of channels (i.e., growth rate) for the convolutional 
layers in the dense block to 32, so 128 channels will be added to each dense block. 


In ResNet, the height and width are reduced between each module by a residual block with 
a stride of 2. Here, we use the transition layer to halve the height and width and halve the 
number of channels. Similar to ResNet, a global pooling layer and a fully connected layer 
are connected at the end to produce the output. 


@d21.add_to_class(DenseNet) 
def __init__(self, num_channels=64, growth_rate=32, arch=(4, 4, 4, 4), 
lr=0.1, num_classes=10): 
super(DenseNet, self).__init__() 
self.save_hyperparameters() 
self.net = nn.Sequential(self.b1()) 
for i, num_convs in enumerate(arch): 
self .net.add_module(f'dense_blk{i+1}’, DenseBlock(num_convs, 
growth_rate)) 
# The number of output channels in the previous dense block 
num_channels += num_convs * growth_rate 
# A transition layer that halves the number of channels is added 
# between the dense blocks 
if i != len(arch) - 1: 
num_channels //= 2 
self .net.add_module(f’tran_blk{i+1}', transition_block( 
num_channels)) 
self.net.add_module('last’, nn.Sequential( 
nn.LazyBatchNorm2d(), nn.ReLU(), 
nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(), 
nn.LazyLinear (num_classes) )) 
self .net.apply(d21.init_cnn) 


8.7.5 Training 


Since we are using a deeper network here, in this section, we will reduce the input height 
and width from 224 to 96 to simplify the computation. 
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model = DenseNet(1r=0.01) 

trainer = d21.Trainer(max_epochs=10, num_gpus=1) 

data = d21.FashionMNIST(batch_size=128, resize=(96, 96)) 
trainer.fit(model, data) 


— 
084) 7 
0.64 — train_loss 
Š === val loss 
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8.7.6 Summary and Discussion 


The main components that comprise DenseNet are dense blocks and transition layers. For 
the latter, we need to keep the dimensionality under control when composing the net- 
work by adding transition layers that shrink the number of channels again. In terms of 
cross-layer connections, in contrast to ResNet, where inputs and outputs are added to- 
gether, DenseNet concatenates inputs and outputs on the channel dimension. Although 
these concatenation operations reuse features to achieve computational efficiency, unfortu- 
nately they lead to heavy GPU memory consumption. As a result, applying DenseNet may 
require more memory-efficient implementations that may increase training time (Pleiss et 
al., 2017). 


8.7.7 Exercises 
1. Why do we use average pooling rather than max-pooling in the transition layer? 


2. One of the advantages mentioned in the DenseNet paper is that its model parameters are 
smaller than those of ResNet. Why is this the case? 


3. One problem for which DenseNet has been criticized is its high memory consumption. 


1. Is this really the case? Try to change the input shape to 224 x 224 to compare the 
actual GPU memory consumption empirically. 


2. Can you think of an alternative means of reducing the memory consumption? How 
would you need to change the framework? 


4. Implement the various DenseNet versions presented in Table 1 of the DenseNet paper 
(Huang et al., 2017). 


5. Design an MLP-based model by applying the DenseNet idea. Apply it to the housing 
price prediction task in Section 5.7. 


Discussions 134, 
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8.8 Designing Convolution Network Architectures 


The previous sections have taken us on a tour of modern network design for computer 
vision. Common to all the work we covered was that it greatly relied on the intuition of 
scientists. Many of the architectures are heavily informed by human creativity and to a 
much lesser extent by systematic exploration of the design space that deep networks offer. 
Nonetheless, this network engineering approach has been tremendously successful. 


Ever since AlexNet (Section 8.1) beat conventional computer vision models on ImageNet, 
it has become popular to construct very deep networks by stacking blocks of convolutions, 
all designed according to the same pattern. In particular, 3 x 3 convolutions were popular- 
ized by VGG networks (Section 8.2). NiN (Section 8.3) showed that even 1 x 1 convolu- 
tions could be beneficial by adding local nonlinearities. Moreover, NiN solved the problem 
of aggregating information at the head of a network by aggregating across all locations. 
GoogLeNet (Section 8.4) added multiple branches of different convolution width, combin- 
ing the advantages of VGG and NiN in its Inception block. ResNets (Section 8.6) changed 
the inductive bias towards the identity mapping (from f(x) = 0). This allowed for very 
deep networks. Almost a decade later, the ResNet design is still popular, a testament to 
its design. Lastly, ResNeXt (Section 8.6.5) added grouped convolutions, offering a better 
trade-off between parameters and computation. A precursor to Transformers for vision, 
the Squeeze-and-Excitation Networks (SENets) allow for efficient information transfer be- 
tween locations (Hu et al., 2018). This was accomplished by computing a per-channel 
global attention function. 


Up to now we have omitted networks obtained via neural architecture search (NAS) (Liu 
et al., 2018, Zoph and Le, 2016). We chose to do so since their cost is usually enormous, 
relying on brute-force search, genetic algorithms, reinforcement learning, or some other 
form of hyperparameter optimization. Given a fixed search space, NAS uses a search strat- 
egy to automatically select an architecture based on the returned performance estimation. 
The outcome of NAS is a single network instance. EfficientNets are a notable outcome of 
this search (Tan and Le, 2019). 


In the following we discuss an idea that is quite different to the quest for the single best 
network. It is computationally relatively inexpensive, it leads to scientific insights on the 
way, and it is quite effective in terms of the quality of outcomes. Let’s review the strategy 
by Radosavovic et al. (2020) to design network design spaces. The strategy combines the 
strength of manual design and NAS. It accomplishes this by operating on distributions of 
networks and optimizing the distributions in a way to obtain good performance for entire 
families of networks. The outcome of it are RegNets, specifically RegNetX and RegNetY, 
plus a range of guiding principles for the design of performant CNNs. 


import torch 
from torch import nn 


(continues on next page) 
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(continued from previous page) 


from torch.nn import functional as F 
from d21 import torch as d21 


8.8.1 The AnyNet Design Space 


The description below closely follows the reasoning in Radosavovic et al. (2020) with some 
abbreviations to make it fit in the scope of the book. To begin, we need a template for the 
family of networks to explore. One of the commonalities of the designs in this chapter is 
that the networks consist of a stem, a body and a head. The stem performs initial image 
processing, often through convolutions with a larger window size. The body consists of 
multiple blocks, carrying out the bulk of the transformations needed to go from raw images 
to object representations. Lastly, the head converts this into the desired outputs, such as 
via a softmax regressor for multiclass classification. The body, in turn, consists of multiple 
stages, operating on the image at decreasing resolutions. In fact, both the stem and each 
subsequent stage quarter the spatial resolution. Lastly, each stage consists of one or more 
blocks. This pattern is common to all networks, from VGG to ResNeXt. Indeed, for the 
design of generic AnyNet networks, Radosavovic et al. (2020) used the ResNeXt block of 
Fig. 8.6.5. 


Cy 1/32 A Cat; A 
Stage 4 Block d, 
ry ry 
n,1 I c, 1/16 Cpr; | Gati ji 
Head Stage 3 | Ses 1x 1 Conv, stride 1 1x 1 Conv, stride 1 
ry 4 
Cy r132 | c, 18 cpr, cdkpt; cf, oN 
Body | Stage 2 | Block 2 oe SOON Da SCEN 1x 1 Conv 
z 8; groups, stride 1 8, groups, stride 2 stride 2 
Cy 7/2 | cp ri4 cpt, | cikar, f efk, af 
Stem | Stage 1 | Block 1 1x 1 Conv, stride 1 1x 1 Conv, stride 1 
ry 
3,7 j cy r12 Cup ar | Cpt, ae Ca 27; 


AnyNet Body Stage i ResNeXt block ResNexXt block with downsampling 


The AnyNet design space. The numbers (c, r) along each arrow indicate the number of 
channels c and the resolution r x r of the images at that point. From left to right: generic 
network structure composed of stem, body, and head; body composed of four stages; 
detailed structure of a stage; two alternative structures for blocks, one without 
downsampling and one that halves the resolution in each dimension. Design choices 
include depth d;, the number of output channels c;, the number of groups g;, and 
bottleneck ratio k; for any stage i. 


Let’s review the structure outlined in Fig. 8.8.1 in detail. As mentioned, an AnyNet consists 
of a stem, body, and head. The stem takes as its input RGB images (3 channels), using a 
3 x3 convolution with a stride of 2, followed by a batch norm, to halve the resolution from 
rxrtor/2xr/2. Moreover, it generates co channels that serve as input to the body. 
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Since the network is designed to work well with ImageNet images of shape 224 x 224 x 3, 
the body serves to reduce this to 7 x 7 x c4 through 4 stages (recall that 224/2'*4 = 7), 
each with an eventual stride of 2. Lastly, the head employs an entirely standard design via 
global average pooling, similar to NiN (Section 8.3), followed by a fully connected layer to 
emit an n-dimensional vector for n-class classification. 


Most of the relevant design decisions are inherent to the body of the network. It proceeds in 
stages, where each stage is composed of the same type of ResNeXt blocks as we discussed 
in Section 8.6.5. The design there is again entirely generic: we begin with a block that 
halves the resolution by using a stride of 2 (the rightmost in Fig. 8.8.1). To match this, the 
residual branch of the ResNeXt block needs to pass through a 1 x 1 convolution. This block 
is followed by a variable number of additional ResNeXt blocks that leave both resolution 
and the number of channels unchanged. Note that a common design practice is to add 
a slight bottleneck in the design of convolutional blocks. As such, with bottleneck ratio 
ki 2 1 we afford some number of channels, c;/k;, within each block for stage i (as the 
experiments show, this is not really effective and should be skipped). Lastly, since we are 
dealing with ResNeXt blocks, we also need to pick the number of groups g; for grouped 
convolutions at stage i. 


This seemingly generic design space provides us nonetheless with many parameters: we 
can set the block width (number of channels) cg, . . . c4, the depth (number of blocks) per 
stage dı, ... d4, the bottleneck ratios k,,...k4, and the group widths (numbers of groups) 
gı, ... g4. In total this adds up to 17 parameters, resulting in an unreasonably large number 
of configurations that would warrant exploring. We need some tools to reduce this huge 
design space effectively. This is where the conceptual beauty of design spaces comes in. 
Before we do so, let’s implement the generic design first. 


class AnyNet(d21.Classifier): 
def stem(self, num_channels): 
return nn.Sequential( 
nn.LazyConv2d(num_channels, kernel_size=3, stride=2, padding=1), 
nn.LazyBatchNorm2d(), nn.ReLU()) 


Each stage consists of depth ResNeXt blocks, where num_channels specifies the block 
width. Note that the first block halves the height and width of input images. 


@d21.add_to_class(AnyNet) 
def stage(self, depth, num_channels, groups, bot_mul): 


blk = [] 
for i in range(depth): 
if i == Q: 


blk. append(d21.ResNeXtBlock(num_channels, groups, bot_mul, 
use_1xlconv=True, strides=2)) 
else: 
blk. append(d21.ResNeXtBlock(num_channels, groups, bot_mul)) 
return nn.Sequential(*blk) 


Putting the network stem, body, and head together, we complete the implementation of 
AnyNet. 
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@d21.add_to_class(AnyNet) 
def __init__(self, arch, stem_channels, 1r=0.1, num_classes=10): 
super(AnyNet, self).__init__Q 
self.save_hyperparameters() 
self.net = nn.Sequential (self .stem(stem_channels)) 
for i, s in enumerate(arch): 
self.net.add_module(f'stage{it+1}', self.stage(*s)) 
self.net.add_module('head’, nn.Sequential( 
nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(), 
nn.LazyLinear (num_classes) )) 
self.net.apply(d21.init_cnn) 


8.8.2 Distributions and Parameters of Design Spaces 


As just discussed in Section 8.8.1, parameters of a design space are hyperparameters of 
networks in that design space. Consider the problem of identifying good parameters in the 
AnyNet design space. We could try finding the single best parameter choice for a given 
amount of computation (e.g., FLOPs and compute time). If we allowed for even only two 
possible choices for each parameter, we would have to explore 2!’ = 131072 combinations 
to find the best solution. This is clearly infeasible because of its exorbitant cost. Even 
worse, we do not really learn anything from this exercise in terms of how one should design 
a network. Next time we add, say, an X-stage, or a shift operation, or similar, we would need 
to start from scratch. Even worse, due to the stochasticity in training (rounding, shuffling, 
bit errors), no two runs are likely to produce exactly the same results. A better strategy 
would be to try to determine general guidelines of how the choices of parameters should 
be related. For instance, the bottleneck ratio, the number of channels, blocks, groups, or 
their change between layers should ideally be governed by a collection of simple rules. The 
approach in Radosavovic et al. (2019) relies on the following four assumptions: 


1. We assume that general design principles actually exist, so that many networks satis- 
fying these requirements should offer good performance. Consequently, identifying a 
distribution over networks can be a sensible strategy. In other words, we assume that 
there are many good needles in the haystack. 


2. We need not train networks to convergence before we can assess whether a network is 
good. Instead, it is sufficient to use the intermediate results as reliable guidance for 
final accuracy. Using (approximate) proxies to optimize an objective is referred to as 
multi-fidelity optimization (Forrester et al., 2007). Consequently, design optimization is 
carried out, based on the accuracy achieved after only a few passes through the dataset, 
reducing the cost significantly. 


3. Results obtained at a smaller scale (for smaller networks) generalize to larger ones. Con- 
sequently, optimization is carried out for networks that are structurally similar, but with 
a smaller number of blocks, fewer channels, etc. Only in the end will we need to verify 
that the so-found networks also offer good performance at scale. 


4. Aspects of the design can be approximately factorized so that it is possible to infer 
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their effect on the quality of the outcome somewhat independently. In other words, the 
optimization problem is moderately easy. 


These assumptions allow us to test many networks cheaply. In particular, we can sample 
uniformly from the space of configurations and evaluate their performance. Subsequently, 
we can evaluate the quality of the choice of parameters by reviewing the distribution of 
error/accuracy that can be achieved with said networks. Denote by F(e) the cumulative 
distribution function (CDF) for errors committed by networks of a given design space, 
drawn using probability disribution p. That is, 


F(e, p) © Prep {e(net) < e}. (8.8.1) 


Our goal is now to find a distribution p over networks such that most networks have a very 


low error rate and where the support of p is concise. Of course, this is computationally 
def 


infeasible to perform accurately. We resort to a sample of networks Z = {net;,.. . net, } 
(with errors e1, . . . , €n, respectively) from p and use the empirical CDF Ê (e, Z) instead: 
` 1< 
F(e, == 1 (e; < e). .8. 
(e, Z) =a (ei < e) (8.8.2) 


Whenever the CDF for one set of choices majorizes (or matches) another CDF it follows 
that its choice of parameters is superior (or indifferent). Accordingly Radosavovic et al. 
(2020) experimented with a shared network bottleneck ratio k; = k for all stages i of the 
network. This gets rid of three of the four parameters governing the bottleneck ratio. To 
assess whether this (negatively) affects the performance one can draw networks from the 
constrained and from the unconstrained distribution and compare the corresonding CDFs. 
It turns out that this constraint does not affect the accuracy of the distribution of networks 
at all, as can be seen in the first panel of Fig. 8.8.2. Likewise, we could choose to pick 
the same group width g; = g occurring at the various stages of the network. Again, this 
does not affect performance, as can be seen in the second panel of Fig. 8.8.2. Both steps 
combined reduce the number of free parameters by six. 
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Comparing error empirical distribution functions of design spaces. AnyNet , is the 
original design space; AnyNet g ties the bottleneck ratios, AnyNet¢ also ties group 
widths, AnyNet increases the network depth across stages. From left to right: (i) tying 
bottleneck ratios has no effect on performance; (ii) tying group widths has no effect on 
performance; (iii) increasing network widths (channels) across stages improves 
performance; (iv) increasing network depths across stages improves performance. Figure 
courtesy of Radosavovic et al. (2020). 


Next we look for ways to reduce the multitude of potential choices for width and depth of the 
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stages. It is a reasonable assumption that, as we go deeper, the number of channels should 
increase, i.€., Ci > Ci-1 (Wi+1 = wi per their notation in Fig. 8.8.2), yielding AnyNetX p. 
Likewise, it is equally reasonable to assume that as the stages progress, they should become 
deeper, i.e., d; > di-1, yielding AnyNetX,. This can be experimentally verified in the third 
and fourth panel of Fig. 8.8.2, respectively. 


8.8.3 RegNet 


The resulting AnyNetX pş design space consists of simple networks following easy-to-interpret 
design principles: 


e Share the bottleneck ratio k; = k for all stages i; 
e Share the group width g; = g for all stages i; 

e Increase network width across stages: Ci < Ci+1; 
e Increase network depth across stages: d; < dj+1. 


This leaves us with a final set of choices: how to pick the specific values for the above 
parameters of the eventual AnyNetX, design space. By studying the best-performing 
networks from the distribution in AnyNetX, one can observe the following: the width 
of the network ideally increases linearly with the block index across the network, i.e., 
Cj = Co + Caj, where j is the block index and slope ca > 0. Given that we get to choose a 
different block width only per stage, we arrive at a piecewise constant function, engineered 
to match this dependence. Furthermore, experiments also show that a bottleneck ratio of 
k = 1 performs best, i.e., we are advised not to use bottlenecks at all. 


We recommend the interested reader reviews further details in the design of specific net- 
works for different amounts of computation by perusing Radosavovic et al. (2020). For 
instance, an effective 32-layer RegNetX variant is given by k = 1 (no bottleneck), g = 16 
(group width is 16), cı = 32 and c2 = 80 channels for the first and second stage, respec- 
tively, chosen to be dı = 4 and dz = 6 blocks deep. The astonishing insight from the 
design is that it still applies, even when investigating networks at a larger scale. Even bet- 
ter, it even holds for Squeeze-and-Excitation (SE) network designs (RegNetY) that have a 
global channel activation (Hu et al., 2018). 


class RegNetX32(AnyNet) : 
def __init__(self, lr=0.1, num_classes=10): 
stem_channels, groups, bot_mul = 32, 16, 1 
depths, channels = (4, 6), (32, 80) 
super().__init__( 
((depthsl0], channels[0], groups, bot_mul), 
(depths[1], channels[1], groups, bot_mul)), 
stem_channels, lr, num_classes) 


We can see that each RegNetX stage progressively reduces resolution and increases output 
channels. 
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RegNetX32().layer_summary((1, 1, 96, 96)) 


Sequential output shape: torch.Size([1, 32, 48, 48]) 
Sequential output shape: torch.Size([1, 32, 24, 24]) 
Sequential output shape: torch.Size([1, 80, 12, 12]) 
Sequential output shape: torch.Size([1, 10]) 


8.8.4 Training 
Training the 32-layer RegNetX on the Fashion-MNIST dataset is just like before. 


model = RegNetX32(1r=0. 05) 

trainer = d21.Trainer(max_epochs=10, num_gpus=1) 

data = d21.FashionMNIST(batch_size=128, resize=(96, 96)) 
trainer.fit(model, data) 


— train_loss 
==- val_loss 
—= val_acc 
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8.8.5 Discussion 


With desirable inductive biases (assumptions or preferences) like locality and translation 
invariance (Section 7.1) for vision, CNNs have been the dominant architectures in this area. 
This remained the case from LeNet up until Transformers (Section 11.7) (Dosovitskiy et 
al., 2021, Touvron et al., 2021) started surpassing CNNs in terms of accuracy. While much 
of the recent progress in terms of vision Transformers can be backported into CNNs (Liu 
et al., 2022), it is only possible at a higher computational cost. Just as importantly, recent 
hardware optimizations (NVIDIA Ampere and Hopper) have only widened the gap in favor 
of Transformers. 


It is worth noting that Transformers have a significantly lower degree of inductive bias to- 
wards locality and translation invariance than CNNs. That learned structures prevailed is 
due, not least, to the availability of large image collections, such as LAION-400m and 
LAION-5B (Schuhmann ef al., 2022) with up to 5 billion images. Quite surprisingly, 
some of the more relevant work in this context even includes MLPs (Tolstikhin et al., 
2021). 


In sum, vision Transformers (Section 11.8) by now lead in terms of state-of-the-art perfor- 
mance in large-scale image classification, showing that scalability trumps inductive biases 
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(Dosovitskiy et al., 2021). This includes pretraining large-scale Transformers (Section 
11.9) with multi-head self-attention (Section 11.5). We invite the readers to dive into these 
chapters for a much more detailed discussion. 


8.8.6 Exercises 


1. Increase the number of stages to four. Can you design a deeper RegNetX that performs 
better? 


2. De-ResNeXt-ify RegNets by replacing the ResNeXt block with the ResNet block. How 
does your new model perform? 


3. Implement multiple instances of a “VioNet” family by violating the design principles of 
RegNetX. How do they perform? Which of (di, Ci, gi, bi) is the most important factor? 


4. Your goal is to design the “perfect” MLP. Can you use the design principles introduced 
above to find good architectures? Is it possible to extrapolate from small to large net- 
works? 


Discussions +35, 
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Up until now, we have focused primarily on fixed-length data. When introducing linear and 
logistic regression in Chapter 3 and Chapter 4 and multilayer perceptrons in Chapter 5, we 
were happy to assume that each feature vector x; consisted of a fixed number of components 
X1,...,Xq, where each numerical feature x; corresponded to a particular attribute. These 
datasets are sometimes called tabular, because they can be arranged in tables, where each 
example i gets its own row, and each attribute gets its own column. Crucially, with tabular 
data, we seldom assume any particular structure over the columns. 


Subsequently, in Chapter 7, we moved on to image data, where inputs consist of the raw 
pixel values at each coordinate in an image. Image data hardly fitted the bill of a protypical 
tabular dataset. There, we needed to call upon convolutional neural networks (CNNs) to 
handle the hierarchical structure and invariances. However, our data were still of fixed 
length. Every Fashion-MNIST image is represented as a 28 x 28 grid of pixel values. 
Moreover, our goal was to develop a model that looked at just one image and then outputted 
a single prediction. But what should we do when faced with a sequence of images, as in a 
video, or when tasked with producing a sequentially structured prediction, as in the case of 
image captioning? 


A great many learning tasks require dealing with sequential data. Image captioning, speech 
synthesis, and music generation all require that models produce outputs consisting of se- 
quences. In other domains, such as time series prediction, video analysis, and musical 
information retrieval, a model must learn from inputs that are sequences. These demands 
often arise simultaneously: tasks such as translating passages of text from one natural lan- 
guage to another, engaging in dialogue, or controlling a robot, demand that models both 
ingest and output sequentially structured data. 


Recurrent neural networks (RNNs) are deep learning models that capture the dynamics of 
sequences via recurrent connections, which can be thought of as cycles in the network of 
nodes. This might seem counterintuitive at first. After all, it is the feedforward nature of 
neural networks that makes the order of computation unambiguous. However, recurrent 
edges are defined in a precise way that ensures that no such ambiguity can arise. Recurrent 
neural networks are unrolled across time steps (or sequence steps), with the same under- 
lying parameters applied at each step. While the standard connections are applied syn- 
chronously to propagate each layer’s activations to the subsequent layer at the same time 
step, the recurrent connections are dynamic, passing information across adjacent time steps. 
As the unfolded view in Fig. 9.1 reveals, RNNs can be thought of as feedforward neural 
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networks where each layer’s parameters (both conventional and recurrent) are shared across 
time steps. 


Output Output 1 Output 2 Output ... Output T 


| {| ft 


Hidden 
layers T 


Hidden Hidden 


layers 1 


Hidden 
layers 2 


layers 


Input Input 1 Input 2 Input ... Input T 


On the left recurrent connections are depicted via cyclic edges. On the right, we unfold 
the RNN over time steps. Here, recurrent edges span adjacent time steps, while 
conventional connections are computed synchronously. 


Like neural networks more broadly, RNNs have a long discipline-spanning history, origi- 
nating as models of the brain popularized by cognitive scientists and subsequently adopted 
as practical modeling tools employed by the machine learning community. As we do for 
deep learning more broadly, in this book we adopt the machine learning perspective, focus- 
ing on RNN%s as practical tools that rose to popularity in the 2010s owing to breakthrough 
results on such diverse tasks as handwriting recognition (Graves et al., 2008), machine 
translation (Sutskever etal., 2014), and recognizing medical diagnoses (Lipton et al., 2016). 
We point the reader interested in more background material to a publicly available compre- 
hensive review (Lipton et al., 2015). We also note that sequentiality is not unique to RNNs. 
For example, the CNNs that we already introduced can be adapted to handle data of varying 
length, e.g., images of varying resolution. Moreover, RNNs have recently ceded consider- 
able market share to Transformer models, which will be covered in Chapter 11. However, 
RNNSs rose to prominence as the default models for handling complex sequential structure 
in deep learning, and remain staple models for sequential modeling to this day. The stories 
of RNNs and of sequence modeling are inextricably linked, and this is as much a chapter 
about the ABCs of sequence modeling problems as it is a chapter about RNNs. 


One key insight paved the way for a revolution in sequence modeling. While the inputs 
and targets for many fundamental tasks in machine learning cannot easily be represented 
as fixed-length vectors, they can often nevertheless be represented as varying-length se- 
quences of fixed-length vectors. For example, documents can be represented as sequences 
of words; medical records can often be represented as sequences of events (encounters, 
medications, procedures, lab tests, diagnoses); videos can be represented as varying-length 
sequences of still images. 


While sequence models have popped up in numerous application areas, basic research in the 
area has been driven predominantly by advances on core tasks in natural language process- 
ing. Thus, throughout this chapter, we will focus our exposition and examples on text data. 
If you get the hang of these examples, then applying the models to other data modalities 
should be relatively straightforward. In the next few sections, we introduce basic notation 
for sequences and some evaluation measures for assessing the quality of sequentially struc- 
tured model outputs. After that, we discuss basic concepts of a language model and use this 
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discussion to motivate our first RNN models. Finally, we describe the method for calculat- 
ing gradients when backpropagating through RNNs and explore some challenges that are 
often encountered when training such networks, motivating the modern RNN architectures 
that will follow in Chapter 10. 


9.1 Working with Sequences 
SSS SSE EOE OSS SS Oe 


Up until now, we have focused on models whose inputs consisted of a single feature vector 
x € R2. The main change of perspective when developing models capable of processing 
sequences is that we now focus on inputs that consist of an ordered list of feature vec- 
tors X1, ..., Xr, where each feature vector x; is indexed by a time step £ € Z* lying in 
RI. 


Some datasets consist of a single massive sequence. Consider, for example, the extremely 
long streams of sensor readings that might be available to climate scientists. In such cases, 
we might create training datasets by randomly sampling subsequences of some predeter- 
mined length. More often, our data arrives as a collection of sequences. Consider the 
following examples: (i) a collection of documents, each represented as its own sequence of 
words, and each having its own length 7;; (ii) sequence representation of patient stays in the 
hospital, where each stay consists of a number of events and the sequence length depends 
roughly on the length of the stay. 


Previously, when dealing with individual inputs, we assumed that they were sampled inde- 
pendently from the same underlying distribution P(X). While we still assume that entire 
sequences (e.g., entire documents or patient trajectories) are sampled independently, we 
cannot assume that the data arriving at each time step are independent of each other. For 
example, the words that likely to appear later in a document depend heavily on words oc- 
curring earlier in the document. The medicine a patient is likely to receive on the 10th day 
of a hospital visit depends heavily on what transpired in the previous nine days. 


This should come as no surprise. If we did not believe that the elements in a sequence 
were related, we would not have bothered to model them as a sequence in the first place. 
Consider the usefulness of the auto-fill features that are popular on search tools and modern 
email clients. They are useful precisely because it is often possible to predict (imperfectly, 
but better than random guessing) what the likely continuations of a sequence might be, 
given some initial prefix. For most sequence models, we do not require independence, or 
even stationarity, of our sequences. Instead, we require only that the sequences themselves 
are sampled from some fixed underlying distribution over entire sequences. 


This flexible approach allows for such phenomena as (i) documents looking significantly 
different at the beginning than at the end; or (ii) patient status evolving either towards recov- 
ery or towards death over the course of a hospital stay; or (iii) customer taste evolving in pre- 
dictable ways over the course of continued interaction with a recommender system. 
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We sometimes wish to predict a fixed target y given sequentially structured input (e.g., sen- 
timent classification based on a movie review). At other times, we wish to predict a sequen- 
tially structured target (y1,..., yr) given a fixed input (e.g., image captioning). Still other 
times, our goal is to predict sequentially structured targets based on sequentially structured 
inputs (e.g., machine translation or video captioning). Such sequence-to-sequence tasks 
take two forms: (i) aligned: where the input at each time step aligns with a correspond- 
ing target (e.g., part of speech tagging); (ii) unaligned: where the input and target do not 
necessarily exhibit a step-for-step correspondence (e.g., machine translation). 


Before we worry about handling targets of any kind, we can tackle the most straightforward 
problem: unsupervised density modeling (also called sequence modeling). Here, given a 
collection of sequences, our goal is to estimate the probability mass function that tells us 
how likely we are to see any given sequence, i.e., p(X1,...,X7). 


zmatplotlib inline 

import torch 

from torch import nn 

from d21 import torch as d21 


9.1.1 Autoregressive Models 


Before introducing specialized neural networks designed to handle sequentially structured 
data, let’s take a look at some actual sequence data and build up some basic intuitions 
and statistical tools. In particular, we will focus on stock price data from the FTSE 100 
index (Fig. 9.1.1). At each time step t € Z*, we observe the price, x;, of the index at that 
time. 


FTSE 100 Index 


1984 1989 1994 1999 2004 2009 2014 


Now suppose that a trader would like to make short-term trades, strategically getting into 
or out of the index, depending on whether they believe that it will rise or decline in the 
subsequent time step. Absent any other features (news, financial reporting data, etc.), the 
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only available signal for predicting the subsequent value is the history of prices to date. 
The trader is thus interested in knowing the probability distribution 


P(x; | X¢-1,---5%1) (9.1.1) 


over prices that the index might take in the subsequent time step. While estimating the entire 
distribution over a continuously valued random variable can be difficult, the trader would 
be happy to focus on a few key statistics of the distribution, particularly the expected value 
and the variance. One simple strategy for estimating the conditional expectation 


E[r | xe-1,..-,.%1)], (9.1.2) 


would be to apply a linear regression model (recall Section 3.1). Such models that regress 
the value of a signal on the previous values of that same signal are naturally called au- 
toregressive models. There is just one major problem: the number of inputs, x;-1,...,x1 
varies, depending on ¢. In other words, the number of inputs increases with the amount of 
data that we encounter. Thus if we want to treat our historical data as a training set, we 
are left with the problem that each example has a different number of features. Much of 
what follows in this chapter will revolve around techniques for overcoming these challenges 
when engaging in such autoregressive modeling problems where the object of interest is 
P(x; | X¢-1,---,%1) or some statistic(s) of this distribution. 


A few strategies recur frequently. First of all, we might believe that although long sequences 
X;-1,-..,X, are available, it may not be necessary to look back so far in the history when 
predicting the near future. In this case we might content ourselves to condition on some 
window of length t and only use x;_1,...,x;-z observations. The immediate benefit is 
that now the number of arguments is always the same, at least for t > r. This allows us to 
train any linear model or deep network that requires fixed-length vectors as inputs. Second, 
we might develop models that maintain some summary h, of the past observations (see 
Fig. 9.1.2) and at the same time update h, in addition to the prediction <;. This leads to 
models that estimate not only x, with £, = P(x; | h+) but also updates of the form h; = 
g(hy-1, X71). Since h; is never observed, these models are also called latent autoregressive 


ee 
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models. 


A latent autoregressive model. 


To construct training data from historical data, one typically creates examples by sampling 
windows randomly. In general, we do not expect time to stand still. However, we often 
assume that while the specific values of x, might change, the dynamics according to which 
each subsequent observation is generated given the previous observations do not. Statisti- 
cians call dynamics that do not change stationary. 
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9.1.2 Sequence Models 


Sometimes, especially when working with language, we wish to estimate the joint probabil- 
ity of an entire sequence. This is a common task when working with sequences composed 
of discrete tokens, such as words. Generally, these estimated functions are called sequence 
models and for natural language data, they are called language models. The field of se- 
quence modeling has been driven so much by natural language processing, that we often 
describe sequence models as “language models”, even when dealing with non-language 
data. Language models prove useful for all sorts of reasons. Sometimes we want to evalu- 
ate the likelihood of sentences. For example, we might wish to compare the naturalness of 
two candidate outputs generated by a machine translation system or by a speech recognition 
system. But language modeling gives us not only the capacity to evaluate likelihood, but 
the ability to sample sequences, and even to optimize for the most likely sequences. 


While language modeling might not, at first glance, look like an autoregressive problem, 
we can reduce language modeling to autoregressive prediction by decomposing the joint 
density of a sequence p(x1,...,x7) into the product of conditional densities in a left-to- 
right fashion by applying the chain rule of probability: 


T 
Pessar) PO) | | PCG lara co, 2). (9.1.3) 
t=2 


Note that if we are working with discrete signals such as words, then the autoregressive 
model must be a probabilistic classifier, outputting a full probability distribution over the 
vocabulary for whatever word will come next, given the leftwards context. 


Markov Models 


Now suppose that we wish to employ the strategy mentioned above, where we condition 
only on the T previous time steps, i.e., X;-1,...,X;—z7, rather than the entire sequence history 
X¢-1,...,X1. Whenever we can throw away the history beyond the previous t steps without 
any loss in predictive power, we say that the sequence satisfies a Markov condition, i.e., that 
the future is conditionally independent of the past, given the recent history. When t = 1, 
we say that the data is characterized by a first-order Markov model, and when t = k, we 
say that the data is characterized by a k-order Markov model. For when the first-order 
Markov condition holds (t = 1) the factorization of our joint probability becomes a product 
of probabilities of each word given the previous word: 


T 
P(x1,...,%7) = P(x1) | | P(r | -1). (9.1.4) 
t=2 


We often find it useful to work with models that proceed as though a Markov condition were 
satisfied, even when we know that this is only approximately true. With real text documents 
we continue to gain information as we include more and more leftwards context. But these 
gains diminish rapidly. Thus, sometimes we compromise, obviating computational and 
statistical difficulties by training models whose validity depends on a k'"-order Markov 
condition. Even today’s massive RNN- and Transformer-based language models seldom 
incorporate more than thousands of words of context. 
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With discrete data, a true Markov model simply counts the number of times that each word 
has occurred in each context, producing the relative frequency estimate of P(x; | x;-1). 
Whenever the data assumes only discrete values (as in language), the most likely sequence 
of words can be computed efficiently using dynamic programming. 


The Order of Decoding 
You may be wondering why we represented the factorization of a text sequence P(x|,...,x7) 
as a left-to-right chain of conditional probabilities. Why not right-to-left or some other, 
seemingly random order? In principle, there is nothing wrong with unfolding P(x,,...,x7) 


in reverse order. The result is a valid factorization: 


1 
P(x1,...,xr) = P(xr) I] P(x; | X1... XT). (9.1.5) 
t=T-1 

However, there are many reasons why factorizing text in the same direction in which we 
read it (left-to-right for most languages, but right-to-left for Arabic and Hebrew) is preferred 
for the task of language modeling. First, this is just a more natural direction for us to think 
about. After all we all read text every day, and this process is guided by our ability to 
anticipate which words and phrases are likely to come next. Just think of how many times 
you have completed someone else’s sentence. Thus, even if we had no other reason to prefer 
such in-order decodings, they would be useful if only because we have better intuitions for 
what should be likely when predicting in this order. 


Second, by factorizing in order, we can assign probabilities to arbitrarily long sequences 
using the same language model. To convert a probability over steps | through ¢ into one that 
extends to word t+ 1 we simply multiply by the conditional probability of the additional to- 
ken given the previous ones: P(x;41,...,%1) = P(x;,...,%1)- P(%p41 | X¢,---5%1)- 


Third, we have stronger predictive models for predicting adjacent words than words at ar- 
bitrary other locations. While all orders of factorization are valid, they do not necessarily 
all represent equally easy predictive modeling problems. This is true not only for language, 
but for other kinds of data as well, e.g., when the data is causally structured. For example, 
we believe that future events cannot influence the past. Hence, if we change x+, we may be 
able to influence what happens for x;4; going forward but not the converse. That is, if we 
change x;, the distribution over past events will not change. In some contexts, this makes 
it easier to predict P(x;4, | x+) than to predict P(x; | x;+;). For instance, in some cases, 
we can find x;4; = f(x;) + € for some additive noise €e, whereas the converse is not true 
(Hoyer et al., 2009). This is great news, since it is typically the forward direction that we 
are interested in estimating. The book by Peters et al. (2017) contains more on this topic. 
We barely scratch the surface of it. 


9.1.3 Training 


Before we focus our attention on text data, let’s first try this out with some continuous- 
valued synthetic data. 


Here, our 1000 synthetic data will follow the trigonometric sin function, applied to 0.01 
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times the time step. To make the problem a little more interesting, we corrupt each sample 
with additive noise. From this sequence we extract training examples, each consisting of 
features and a label. 


class Data(d21.DataModule): 
def __init__(self, batch_size=16, T=1000, num_train=600, tau=4): 
self .save_hyperparameters() 
self.time = torch.arange(1, T + 1, dtype=torch. float32) 
self.x = torch.sin(@.01 x self.time) + torch.randn(T) x 0.2 


data = Data() 
d21.plot(data.time, data.x, ‘time’, ‘x’, xlim=[1, 1000], figsize=(6, 3)) 


200 400 600 800 1000 
time 


To begin, we try a model that acts as if the data satisfied a t"-order Markov condition, 
and thus predicts x; using only the past t observations. Thus for each time step we have an 
example with label y = x, and features x, = [x;-7,...,X;-1]. The astute reader might have 
noticed that this results in 1000—t examples, since we lack sufficient history for y1,..., yr. 
While we could pad the first t sequences with zeros, to keep things simple, we drop them 
for now. The resulting dataset contains T — t examples, where each input to the model has 
sequence length t. We create a data iterator on the first 600 examples, covering a period of 
the sin function. 


@d21.add_to_class(Data) 

def get_dataloader(self, train): 
features = [self.x[i : self.T-self.tauti] for i in range(self.tau)] 
self.features = torch.stack(features, 1) 
self.labels = self.x[self.tau:].reshape((-1, 1)) 
i = slice(@, self.num_train) if train else slice(self.num_train, None) 
return self.get_tensorloader([self.features, self.labels], train, i) 


In this example our model will be a standard linear regression. 


model = d21.LinearRegression(1r=0. 01) 
trainer = d21.Trainer(max_epochs=5) 
trainer.fit(model, data) 
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— train loss 
0.254 ==- val_loss 


9.1.4 Prediction 


To evaluate our model, we first check how well it performs at one-step-ahead prediction. 


onestep_preds = model(data.features).detach() .numpy() 
d2l.plot(data.time[data.tau:], [data.labels, onestep_preds], ‘time’, ‘x’, 
legend=['labels', '1-step preds'], figsize=(6, 3)) 


1.55 
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These predictions look good, even near the end at tf = 1000. 


But what if we only observed sequence data up until time step 604 (n_train + tau) and 
wished to make predictions several steps into the future? Unfortunately, we cannot directly 
compute the one-step-ahead prediction for time step 609, because we do not know the cor- 
responding inputs, having seen only up to x604. We can address this problem by plugging in 
our earlier predictions as inputs to our model for making subsequent predictions, projecting 
forward, one step at a time, until reaching the desired time step: 
$605 = f (X601, X602 X603 X604)» 
2606 = f (x602, X603, X604, 2605), 
2607 = f (x603, X604, 605, 2606), 
X E re 9.1.6 
£608 = f (x604, 2605, £606; 2607), (n 


£609 = f (£605, £606, £607, £608), 


Generally, for an observed sequence x,,...,x;, its predicted output ĉŝ;+x at time step t + k 
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is called the k-step-ahead prediction. Since we have observed up to x604, its k-step-ahead 
prediction is %¢044x. In other words, we will have to keep on using our own predictions to 
make multistep-ahead predictions. Let’s see how well this goes. 


multistep_preds = torch.zeros(data.T) 
multistep_preds[:] = data.x 
for i in range(data.num_train + data.tau, data.T): 
multistep_predsLi] = model( 
multistep_preds[i - data.tau:i].reshape((1, -1))) 
multistep_preds = multistep_preds.detach() .numpy() 


d21.plot({[data.time[data.tau:], data.time[data.num_train+data.tau:]], 
[Lonestep_preds, multistep_preds[data.num_train+data.tau:]], ‘time’, 


erat 


x', legend=['1-step preds’, ‘multistep preds'], figsize=(6, 3)) 
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Unfortunately, in this case we fail spectacularly. The predictions decay to a constant pretty 
quickly after a few steps. Why did the algorithm perform so much worse when predicting 
further into the future? Ultimately, this is down to the fact that errors build up. Let’s say 
that after step 1 we have some error e} = €. Now the input for step 2 is perturbed by 1, 
hence we suffer some error in the order of e2 = €+ ce, for some constant c, and so on. The 
predictions can diverge rapidly from the true observations. You may already be familiar 
with this common phenomenon. For instance, weather forecasts for the next 24 hours tend 
to be pretty accurate but beyond that, accuracy declines rapidly. We will discuss methods 
for improving this throughout this chapter and beyond. 


Let’s take a closer look at the difficulties in k-step-ahead predictions by computing predic- 
tions on the entire sequence for k = 1,4, 16, 64. 


def k_step_pred(k): 

features = [] 

for i in range(data. tau): 
features.append(data.x[i : itdata.T-data. tau-k+1]) 

# The (it+ttau)-th element stores the (i+1)-step-ahead predictions 

for i in range(k): 
preds = model(torch.stack(features[i : itdata.tau], 1)) 
features. append(preds.reshape(-1)) 

return features[data. tau: ] 
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steps = (1, 4, 16, 64) 
preds = k_step_pred(steps[-1]) 
d21.plot(data. time[data. taut+steps[-1]-1:], 
[preds[k - 1].detach().numpy() for k in steps], ‘time’, ’x’, 
legend=[f'{k}-step preds’ for k in steps], figsize=(6, 3)) 


— 1-step preds 
1.04 --- 4-step preds 
ae —-- 16-step preds 
0.54 > an P e 64-step preds 
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This clearly illustrates how the quality of the prediction changes as we try to predict further 
into the future. While the 4-step-ahead predictions still look good, anything beyond that is 
almost useless. 


9.1.5 Summary 


There is quite a difference in difficulty between interpolation and extrapolation. Conse- 
quently, if you have a sequence, always respect the temporal order of the data when training, 
i.e., never train on future data. Given this kind of data, sequence models require specialized 
Statistical tools for estimation. Two popular choices are autoregressive models and latent- 
variable autoregressive models. For causal models (e.g., time going forward), estimating 
the forward direction is typically a lot easier than the reverse direction. For an observed 
sequence up to time step f, its predicted output at time step t+ k is the k-step-ahead predic- 
tion. As we predict further in time by increasing k, the errors accumulate and the quality 
of the prediction degrades, often dramatically. 


9.1.6 Exercises 


1. Improve the model in the experiment of this section. 
1. Incorporate more than the past four observations? How many do you really need? 


2. How many past observations would you need if there was no noise? Hint: you can 
write sin and cos as a differential equation. 


3. Can you incorporate older observations while keeping the total number of features 
constant? Does this improve accuracy? Why? 


4. Change the neural network architecture and evaluate the performance. You may train 
the new model with more epochs. What do you observe? 
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2. An investor wants to find a good security to buy. They look at past returns to decide 
which one is likely to do well. What could possibly go wrong with this strategy? 


3. Does causality also apply to text? To which extent? 


4. Give an example for when a latent autoregressive model might be needed to capture the 
dynamic of the data. 
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9.2 Converting Raw Text into Sequence Data 
SSeS SSS ee) 


Throughout this book, we will often work with text data represented as sequences of words, 
characters, or word pieces. To get going, we will need some basic tools for converting raw 
text into sequences of the appropriate form. Typical preprocessing pipelines execute the 
following steps: 


1. Load text as strings into memory. 
2. Split the strings into tokens (e.g., words or characters). 


3. Build a vocabulary dictionary to associate each vocabulary element with a numerical 
index. 


4. Convert the text into sequences of numerical indices. 


import collections 

import random 

import re 

import torch 

from d21 import torch as d21 


9.2.1 Reading the Dataset 


Here, we will work with H. G. Wells’ The Time Machine !37 , a book containing just over 
30,000 words. While real applications will typically involve significantly larger datasets, 
this is sufficient to demonstrate the preprocessing pipeline. The following _download 
method reads the raw text into a string. 


class TimeMachine(d21.DataModule): #@save 
"""The Time Machine dataset.”"” 
def _download(self): 
fname = d21.download(d21.DATA_URL + 'timemachine.txt’, self.root, 
"090b5e7e70c295757F55df93cb0a180b9691891a' ) 
with open(fname) as f: 
return f.read() 


(continues on next page) 
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(continued from previous page) 


data = TimeMachine() 
raw_text = data._download() 
raw_textL:60] 


"The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra’ 


For simplicity, we ignore punctuation and capitalization when preprocessing the raw text. 


@d21.add_to_class(TimeMachine) #@save 
def _preprocess(self, text): 
return re.sub('’[*A-Za-z]+’, ' ', text).lower() 


text = data._preprocess(raw_text) 
text[:60] 


"the time machine by h g wells i the time traveller for so it’ 


9.2.2 Tokenization 


Tokens are the atomic (indivisible) units of text. Each time step corresponds to 1 token, 
but what precisely constitutes a token is a design choice. For example, we could represent 
the sentence “Baby needs a new pair of shoes” as a sequence of 7 words, where the set of 
all words comprise a large vocabulary (typically tens or hundreds of thousands of words). 
Or we would represent the same sentence as a much longer sequence of 30 characters, 
using a much smaller vocabulary (there are only 256 distinct ASCII characters). Below, we 
tokenize our preprocessed text into a sequence of characters. 


@d21.add_to_class(TimeMachine) #@save 
def _tokenize(self, text): 
return list(text) 


tokens = data._tokenize(text) 
'' join(tokens[:30]) 


' 


i A ś 
t,he; ,t,i,m,e, sia, ¢, hin, e), sbyy; hs 8) pW Soll ll Ss 


9.2.3 Vocabulary 


These tokens are still strings. However, the inputs to our models must ultimately consist of 
numerical inputs. Next, we introduce a class for constructing vocabularies, i.e., objects that 
associate each distinct token value with a unique index. First, we determine the set of unique 
tokens in our training corpus. We then assign a numerical index to each unique token. Rare 
vocabulary elements are often dropped for convenience. Whenever we encounter a token at 
training or test time that had not been previously seen or was dropped from the vocabulary, 
we represent it by a special “<unk>” token, signifying that this is an unknown value. 
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class Vocab: #@save 
"""Vocabulary for text. 
def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]): 
# Flatten a 2D list if needed 
if tokens and isinstance(tokens[0], list): 
tokens = [token for line in tokens for token in line] 
# Count token frequencies 
counter = collections.Counter (tokens) 
self.token_freqs = sorted(counter.items(), key=lambda x: x[1], 
reverse=True) 
# The list of unique tokens 
self.idx_to_token = list(sorted(set([’<unk>’] + reserved_tokens + [ 
token for token, freq in self.token_freqs if freq >= min_freq]))) 
self.token_to_idx = {token: idx 
for idx, token in enumerate(self.idx_to_token) } 


nnn 


def __len__(self): 
return len(self.idx_to_token) 


def __getitem__(self, tokens): 
if not isinstance(tokens, (list, tuple)): 
return self.token_to_idx.get(tokens, self.unk) 
return [self.__getitem__(token) for token in tokens] 


def to_tokens(self, indices): 
if hasattr(indices, '__len__') and len(indices) > 1: 
return [self.idx_to_token[int(index)] for index in indices] 
return self.idx_to_token[indices] 


@property 
def unk(self): # Index for the unknown token 
return self.token_to_idx['<unk>'] 


We now construct a vocabulary for our dataset, converting the sequence of strings into a 
list of numerical indices. Note that we have not lost any information and can easily convert 
our dataset back to its original (string) representation. 


vocab = Vocab(tokens) 

indices = vocab[tokens[:10]] 
print(’indices:'’, indices) 
print(’words:', vocab.to_tokens(indices)) 


indices: [21, 9, 6, 0, 21, 10, 14, 6, ð, 14] 
words EG, Sh, (et, E Meh, Re, ta, te nt 


9.2.4 Putting It All Together 


Using the above classes and methods, we package everything into the following build 
method of the TimeMachine class, which returns corpus, a list of token indices, and vocab, 
the vocabulary of The Time Machine corpus. The modifications we did here are: (i) we 
tokenize text into characters, not words, to simplify the training in later sections; (ii) corpus 
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is a single list, not a list of token lists, since each text line in The Time Machine dataset is 
not necessarily a sentence or paragraph. 


@d21.add_to_class(TimeMachine) #@save 
def build(self, raw_text, vocab=None): 
tokens = self._tokenize(self._preprocess(raw_text) ) 
if vocab is None: vocab = Vocab(tokens) 
corpus = [vocab[token] for token in tokens] 
return corpus, vocab 


corpus, vocab = data. build(raw_text) 
len(corpus), len(vocab) 


(173428, 28) 


9.2.5 Exploratory Language Statistics 


Using the real corpus and the Vocab class defined over words, we can inspect basic statistics 
concerning word use in our corpus. Below, we construct a vocabulary from words used in 
The Time Machine and print the ten most frequently occurring of them. 


words = text.splitQ 
vocab = Vocab(words) 
vocab. token_freqs[: 10] 


[('the', 2261), 


Ci’, 1267), 
C’and’, 1245), 
Cof', 1155), 
('a', 816), 
('to’, 695), 
('was’, 552), 
Cin’, 541), 
('that’, 443), 
('my'’, 440)] 


Note that the ten most frequent words are not all that descriptive. You might even imagine 
that we might see a very similar list if we had chosen any book at random. Articles like 
“the” and “a”, pronouns like “i” and “my”, and prepositions like “of”, “to”, and “in” occur 
often because they serve common syntactic roles. Such words that are common but not 
particularly descriptive are often called stop words and, in previous generations of text 
classifiers based on so-called bag-of-words representations, they were most often filtered 
out. However, they carry meaning and it is not necessary to filter them out when working 
with modern RNN- and Transformer-based neural models. If you look further down the 
list, you will notice that word frequency decays quickly. The 10" most frequent word is 
less than 1/5 as common as the most popular. Word frequency tends to follow a power law 
distribution (specifically the Zipfian) as we go down the ranks. To get a better idea, we plot 
the figure of the word frequency. 


340 


Recurrent Neural Networks 


freqs = [freq for token, freq in vocab. token_freqs] 
d21.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)’, 
xscale='log’, yscale='log') 


103 4 


102 4 


101 4 


frequency: n(x) 


10° 4 


10° 101 10? 103 
token: x 


After dealing with the first few words as exceptions, all the remaining words roughly follow 
a straight line on a log-log plot. This phenomenon is captured by Zipf’s law, which states 
that the frequency n; of the i” most frequent word is: 


pees. (9.2.1) 
12 
which is equivalent to 
logn; = —alogi+c, (9.2.2) 


where a is the exponent that characterizes the distribution and c is a constant. This should 
already give us pause for thought if we want to model words by counting statistics. After 
all, we will significantly overestimate the frequency of the tail, also known as the infre- 
quent words. But what about the other word combinations, such as two consecutive words 
(bigrams), three consecutive words (trigrams), and beyond? Let’s see whether the bigram 
frequency behaves in the same manner as the single word (unigram) frequency. 


bigram_tokens = ['--'.join(pair) for pair in zip(words[:-1], words[1:])] 
bigram_vocab = Vocab(bigram_tokens) 
bigram_vocab. token_freqsL: 10] 


[C'of--the’, 309), 
('in--the’, 169), 
('i--had'’, 130), 
('i--was', 112), 
('and--the’, 109), 
('the--time'’, 102), 
('it--was', 99), 
('to--the’, 85), 
('as--i', 78), 
('of--a', 73)] 


One thing is notable here. Out of the ten most frequent word pairs, nine are composed of 
both stop words and only one is relevant to the actual book—“‘the time”. Furthermore, let’s 
see whether the trigram frequency behaves in the same manner. 
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trigram_tokens = [’--’.join(triple) for triple in zip( 
words[:-2], words[1:-1], words[2:])] 

trigram_vocab = Vocab(trigram_tokens) 

trigram_vocab. token_freqs[:10] 


[(’the--time--traveller’, 59), 
('the--time--machine’, 30), 
('the--medical--man', 24), 
('it--seemed--to’, 16), 
(’it--was--a', 15), 
('here--and--there', 15), 
('seemed--to--me’, 14), 
('i--did--not', 14), 
(’i--saw--the’, 13), 
('i--began--to’, 13)] 


Now, let’s visualize the token frequency among these three models: unigrams, bigrams, 
and trigrams. 


bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs] 

trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs] 

d21.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x' 
ylabel='frequency: n(x)', xscale='log’, yscale='log’, 


, 


legend=['unigram', 'bigram’, 'trigram']) 
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This figure is quite exciting. First, beyond unigram words, sequences of words also appear 
to be following Zipf’s law, albeit with a smaller exponent « in (9.2.1), depending on the 
sequence length. Second, the number of distinct n-grams is not that large. This gives us 
hope that there is quite a lot of structure in language. Third, many n-grams occur very 
rarely. This makes certain methods unsuitable for language modeling and motivates the 
use of deep learning models. We will discuss this in the next section. 


9.2.6 Summary 


Text is among the most common forms of sequence data encountered in deep learning. 
Common choices for what constitutes a token are characters, words, and word pieces. To 
preprocess text, we usually (i) split text into tokens; (ii) build a vocabulary to map token 
strings to numerical indices; and (iii) convert text data into token indices for models to 
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manipulate. In practice, the frequency of words tends to follow Zipf’s law. This is true not 
just for individual words (unigrams), but also for n-grams. 


9.2.7 Exercises 


1. In the experiment of this section, tokenize text into words and vary the min_freq argu- 
ment value of the Vocab instance. Qualitatively characterize how changes in min_freq 
impact the size of the resulting vocabulary. 


2. Estimate the exponent of Zipfian distribution for unigrams, bigrams, and trigrams in this 
corpus. 


3. Find some other sources of data (download a standard machine learning dataset, pick 
another public domain book, scrape a website, etc). For each, tokenize the data at both 
the word and character levels. How do the vocabulary sizes compare with The Time 
Machine corpus at equivalent values of min_freq. Estimate the exponent of the Zipfian 
distribution corresponding to the unigram and bigram distributions for these corpora. 
How do they compare with the values that you observed for The Time Machine corpus? 
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9.3 Language Models 
DO S) 


In Section 9.2, we saw how to map text sequences into tokens, where these tokens can be 
viewed as a sequence of discrete observations such as words or characters. Assume that 
the tokens in a text sequence of length T are in turn x1,x2,...,x7. The goal of language 
models is to estimate the joint probability of the whole sequence: 


P(x1,X2,..., XT), (9.3.1) 


where statistical tools in Section 9.1 can be applied. 


Language models are incredibly useful. For instance, an ideal language model should 
generate natural text on its own, simply by drawing one token at a time x, ~ P(x: | 
Xt-1,-.--,X1). Quite unlike the monkey using a typewriter, all text emerging from such 
a model would pass as natural language, e.g., English text. Furthermore, it would be suffi- 
cient for generating a meaningful dialog, simply by conditioning the text on previous dialog 
fragments. Clearly we are still very far from designing such a system, since it would need 
to understand the text rather than just generate grammatically sensible content. 


Nonetheless, language models are of great service even in their limited form. For instance, 
the phrases “to recognize speech” and “to wreck a nice beach” sound very similar. This can 
cause ambiguity in speech recognition, which is easily resolved through a language model 
that rejects the second translation as outlandish. Likewise, in a document summarization 
algorithm it is worthwhile knowing that “dog bites man” is much more frequent than “man 
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bites dog”, or that “I want to eat grandma” is a rather disturbing statement, whereas “I want 
to eat, grandma” is much more benign. 


import torch 
from d21 import torch as d21 


9.3.1 Learning Language Models 


The obvious question is how we should model a document, or even a sequence of tokens. 
Suppose that we tokenize text data at the word level. Let’s start by applying basic probability 
rules: 


T 
P(x1,X9,...,XT) = | [Pe | Xis... ,Xt-1)- (9.3.2) 


t=1 


For example, the probability of a text sequence containing four words would be given 
as: 


P(deep, learning, is, fun) 


9.3.3 
=P(deep)P (learning | deep) P(is | deep, learning) P (fun | deep, learning, is). ( ) 


Markov Models and n-grams 


Among those sequence model analyses in Section 9.1, let’s apply Markov models to lan- 
guage modeling. A distribution over sequences satisfies the Markov property of first order 
if P(Xp41 | X7,--->%1) = P(xp41 | x). Higher orders correspond to longer dependencies. 
This leads to a number of approximations that we could apply to model a sequence: 


P(x1, x2, x3, X4) = P(x1)P(x2)P(x3)P(x4), 
P(x1,x2, x3, X4) = P(x1)P(x2 | x1)P(x3 | x2)P(x4 | x3), (9.3.4) 


P(x1,x2, x3, X4) = P(x1)P(x2 | x1)P(x3 | x1, x2)P(x4 | x2, x3). 


The probability formulae that involve one, two, and three variables are typically referred 
to as unigram, bigram, and trigram models, respectively. In order to compute the language 
model, we need to calculate the probability of words and the conditional probability of 
a word given the previous few words. Note that such probabilities are language model 
parameters. 


Word Frequency 


Here, we assume that the training dataset is a large text corpus, such as all Wikipedia entries, 
Project Gutenberg 8° , and all text posted on the web. The probability of words can be 
calculated from the relative word frequency of a given word in the training dataset. For 
example, the estimate P(deep) can be calculated as the probability of any sentence starting 
with the word “deep”. A slightly less accurate approach would be to count all occurrences 
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of the word “deep” and divide it by the total number of words in the corpus. This works 
fairly well, particularly for frequent words. Moving on, we could attempt to estimate 


n(deep, learning) 


P(learning | deep) = (9.3.5) 


n(deep) 
where n(x) and n(x, x’) are the number of occurrences of singletons and consecutive word 
pairs, respectively. Unfortunately, estimating the probability of a word pair is somewhat 
more difficult, since the occurrences of “deep learning” are a lot less frequent. In particular, 
for some unusual word combinations it may be tricky to find enough occurrences to get 
accurate estimates. As suggested by the empirical results in Section 9.2.5, things take a 
turn for the worse for three-word combinations and beyond. There will be many plausible 
three-word combinations that we likely will not see in our dataset. Unless we provide some 
solution to assign such word combinations a nonzero count, we will not be able to use them 
in a language model. If the dataset is small or if the words are very rare, we might not find 
even a single one of them. 


Laplace Smoothing 


A common strategy is to perform some form of Laplace smoothing. The solution is to add 
a small constant to all counts. Denote by n the total number of words in the training set and 
m the number of unique words. This solution helps with singletons, e.g., via 


P(x) = n(x)+ €/m 
n+€& 
ve n(x, x’) + ©) P(x’) 
P = — 3. 
x | x) Ge (9.3.6) 
At to j n(x, x’, x”) + 63P(x’’) 
P = 
ime) n(x, x’) + 6 


Here €, €2, and e3 are hyperparameters. Take €; as an example: when €; = 0, no smoothing 
is applied; when € approaches positive infinity, P(x) approaches the uniform probability 
1/m. The above is a rather primitive variant of what other techniques can accomplish 
(Wood et al., 2011). 


Unfortunately, models like this get unwieldy rather quickly for the following reasons. First, 
as discussed in Section 9.2.5, many n-grams occur very rarely, making Laplace smoothing 
rather unsuitable for language modeling. Second, we need to store all counts. Third, this 
entirely ignores the meaning of the words. For instance, “cat” and “feline” should occur in 
related contexts. It is quite difficult to adjust such models to additional contexts, whereas, 
deep learning based language models are well suited to take this into account. Last, long 
word sequences are almost certain to be novel, hence a model that simply counts the fre- 
quency of previously seen word sequences is bound to perform poorly there. Therefore, we 
focus on using neural networks for language modeling in the rest of the chapter. 


9.3.2 Perplexity 
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Next, let’s discuss about how to measure the quality of the language model, which we 
will then use to evaluate our models in the subsequent sections. One way is to check how 
surprising the text is. A good language model is able to predict, with high accuracy, the 
tokens that come next. Consider the following continuations of the phrase “It is raining”, 
as proposed by different language models: 


1. “It is raining outside” 
2. “It is raining banana tree” 
3. “It is raining piouw;kcj pwepoiut” 


In terms of quality, Example 1 is clearly the best. The words are sensible and logically co- 
herent. While it might not quite accurately reflect which word follows semantically (“in San 
Francisco” and “in winter” would have been perfectly reasonable extensions), the model is 
able to capture which kind of word follows. Example 2 is considerably worse by producing 
a nonsensical extension. Nonetheless, at least the model has learned how to spell words 
and some degree of correlation between words. Last, Example 3 indicates a poorly trained 
model that does not fit data properly. 


We might measure the quality of the model by computing the likelihood of the sequence. 
Unfortunately this is a number that is hard to understand and difficult to compare. After all, 
shorter sequences are much more likely to occur than the longer ones, hence evaluating the 
model on Tolstoy’s magnum opus War and Peace will inevitably produce a much smaller 
likelihood than, say, on Saint-Exupery’s novella The Little Prince. What is missing is the 
equivalent of an average. 


Information theory comes handy here. We defined entropy, surprisal, and cross-entropy 
when we introduced the softmax regression (Section 4.1.3). If we want to compress text, 
we can ask about predicting the next token given the current set of tokens. A better language 
model should allow us to predict the next token more accurately. Thus, it should allow us to 
spend fewer bits in compressing the sequence. So we can measure it by the cross-entropy 
loss averaged over all the n tokens of a sequence: 


1 n 
— > log Par | Xp-1,---5%X1), (9.3.7) 
t=1 


where P is given by a language model and x; is the actual token observed at time step t from 
the sequence. This makes the performance on documents of different lengths comparable. 
For historical reasons, scientists in natural language processing prefer to use a quantity 
called perplexity. In a nutshell, it is the exponential of (9.3.7): 


1 n 
exp >- Dj log P(x ¥p2igeessae) |. (9.3.8) 
t=l 
Perplexity can be best understood as the reciprocal of the geometric mean of the number of 
real choices that we have when deciding which token to pick next. Let’s look at a number 
of cases: 
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e In the best case scenario, the model always perfectly estimates the probability of the 
target token as 1. In this case the perplexity of the model is 1. 


e In the worst case scenario, the model always predicts the probability of the target token 
as 0. In this situation, the perplexity is positive infinity. 


e At the baseline, the model predicts a uniform distribution over all the available tokens of 
the vocabulary. In this case, the perplexity equals the number of unique tokens of the 
vocabulary. In fact, if we were to store the sequence without any compression, this 
would be the best we could do for encoding it. Hence, this provides a nontrivial upper 
bound that any useful model must beat. 


9.3.3 Partitioning Sequences 


We will design language models using neural networks and use perplexity to evaluate how 
good the model is at predicting the next token given the current set of tokens in text se- 
quences. Before introducing the model, let’s assume that it processes a minibatch of se- 
quences with predefined length at a time. Now the question is how to read minibatches of 
input sequences and target sequences at random. 


Suppose that the dataset takes the form of a sequence of T token indices in corpus. We 
will partition it into subsequences, where each subsequence has n tokens (time steps). To 
iterate over (almost) all the tokens of the entire dataset for each epoch and obtain all possible 
length-n subsequences, we can introduce randomness. More concretely, at the beginning 
of each epoch, discard the first d tokens, where d € [0, n) is uniformly sampled at random. 
The rest of the sequence is then partitioned into m = | (T — d) /n] subsequences. Denote by 
Xs = [X7,---,Xt+n—1] the length-n subsequence starting from token x; at time step t. The 
resulting m partitioned subsequences are X4, Xg4n,-.. ,Xd+n(m—1)- Each subsequence will 
be used as an input sequence into the language model. 


For language modeling, the goal is to predict the next token based on the tokens we have 
seen so far; hence the targets (labels) are the original sequence, shifted by one token. The 
target sequence for any input sequence x; is X;4; with length n. 


Input sequences: thle time machine 


Target sequences: the| time| machline by h g| well 


Obtaining five pairs of input sequences and target sequences from partitioned length-5 
subsequences. 


Fig. 9.3.1 shows an example of obtaining five pairs of input sequences and target sequences 
with n = 5 and d = 2. 


@d21.add_to_class(d21.TimeMachine) #@save 

def __init__(self, batch_size, num_steps, num_train=10000, num_val=50Q00): 
super(d21.TimeMachine, self).__init__() 
self .save_hyperparameters() 
corpus, self.vocab = self.build(self._download()) 


(continues on next page) 
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(continued from previous page) 


array = torch.tensor(LcorpusLi:i+num_steps+1] 
for i in range(len(corpus)-num_steps) ]) 
self.X, self.Y = array[:,:-1], array[:,1:] 


To train language models, we will randomly sample pairs of input sequences and target 
sequences in minibatches. The following data loader randomly generates a minibatch from 
the dataset each time. The argument batch_size specifies the number of subsequence 
examples in each minibatch and num_steps is the subsequence length in tokens. 


@d21.add_to_class(d21.TimeMachine) #@save 
def get_dataloader(self, train): 
idx = slice(@, self.num_train) if train else slice( 
self.num_train, self.num_train + self.num_val) 
return self.get_tensorloader([self.X, self.Y], train, idx) 


As we can see in the following, a minibatch of target sequences can be obtained by shifting 
the input sequences by one token. 


data = d21.TimeMachine(batch_size=2, num_steps=10) 
for X, Y in data. train_dataloader(): 

pram Me DG Ne 5 YO) 

break 


Downloading ../data/timemachine.txt from http://d21-data.s3-accelerate. 
<amazonaws.com/timemachine. txt... 
X: tensor([[10, 4, 2, 21, 10, 16, 15, 0, 20, 2], 
[21, 9, 6, 19, ©, 24, 2, 26, @, 16]]) 
Y: tensor([L 4, 2, 21, 10, 16, 15, @, 20, 2, 10], 
[ 9, 6, 19, 0, 24, 2, 26, 0, 16, 91]) 


9.3.4 Summary and Discussion 


Language models estimate the joint probability of a text sequence. For long sequences, 
n-grams provide a convenient model by truncating the dependence. However, there is a lot 
of structure but not enough frequency to deal efficiently with infrequent word combinations 
via Laplace smoothing. Thus, we will focus on neural language modeling in subsequent 
sections. To train language models, we can randomly sample pairs of input sequences 
and target sequences in minibatches. After training, we will use perplexity to measure the 
language model quality. 


Language models can be scaled up with increased data size, model size, and amount in 
training compute. Large language models can perform desired tasks by predicting output 
text given input text instructions. As we will discuss later (e.g., Section 11.9), at the present 
moment large language models form the basis of state-of-the-art systems across diverse 
tasks. 
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9.3.5 Exercises 


1. Suppose there are 100,000 words in the training dataset. How much word frequency 
and multi-word adjacent frequency does a four-gram need to store? 


2. How would you model a dialogue? 
3. What other methods can you think of for reading long sequence data? 


4. Consider our method for discarding a uniformly random number of the first few tokens 
at the beginning of each epoch. 


1. Does it really lead to a perfectly uniform distribution over the sequences on the docu- 
ment? 


2. What would you have to do to make things even more uniform? 


5. If we want a sequence example to be a complete sentence, what kind of problem does 
this introduce in minibatch sampling? How can we fix it? 
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9.4 Recurrent Neural Networks 
| 


In Section 9.3 we described Markov models and n-grams for language modeling, where 
the conditional probability of token x; at time step t only depends on the n — 1 previous 
tokens. If we want to incorporate the possible effect of tokens earlier than time step t — 
(n— 1) on xz, we need to increase n. However, the number of model parameters would also 
increase exponentially with it, as we need to store |V|” numbers for a vocabulary set V. 
Hence, rather than modeling P(x; | x;-1,...,X+—n+1) itis preferable to use a latent variable 
model, 


P(xp | Xt-1,---5X1) © P(x: | ht-1), (9.4.1) 


where h;_ 1 is a hidden state that stores the sequence information up to time step t — 1. In 
general, the hidden state at any time step t could be computed based on both the current 
input x; and the previous hidden state h;_1: 


hy = f (Xt, h1). (9.4.2) 


For a sufficiently powerful function f in (9.4.2), the latent variable model is not an approx- 
imation. After all, 4; may simply store all the data it has observed so far. However, it could 
potentially make both computation and storage expensive. 


Recall that we have discussed hidden layers with hidden units in Chapter 5. It is noteworthy 
that hidden layers and hidden states refer to two very different concepts. Hidden layers are, 
as explained, layers that are hidden from view on the path from input to output. Hidden 
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states are technically speaking inputs to whatever we do at a given step, and they can only 
be computed by looking at data at previous time steps. 


Recurrent neural networks (RNNs) are neural networks with hidden states. Before intro- 
ducing the RNN model, we first revisit the MLP model introduced in Section 5.1. 


import torch 
from d21 import torch as d21 


9.4.1 Neural Networks without Hidden States 


Let’s take a look at an MLP with a single hidden layer. Let the hidden layer’s activation 
function be ¢. Given a minibatch of examples X € R’*@ with batch size n and d inputs, 
the hidden layer output H € R”*” is calculated as 


H = 6(X Wyn + bn). (9.4.3) 


In (9.4.3), we have the weight parameter Wx, € R@*", the bias parameter bp € R!*”, and 
the number of hidden units h, for the hidden layer. So armed, we apply broadcasting (see 
Section 2.1.4) during the summation. Next, the hidden layer output H is used as input of 
the output layer, which is given by 


O = HW, + by, (9.4.4) 


where O € R”*4 is the output variable, Whq € R’*4 is the weight parameter, and bg € 
R!*4 is the bias parameter of the output layer. If it is a classification problem, we can use 
softmax(O) to compute the probability distribution of the output categories. 


This is entirely analogous to the regression problem we solved previously in Section 9.1, 
hence we omit details. Suffice it to say that we can pick feature-label pairs at random and 
learn the parameters of our network via automatic differentiation and stochastic gradient 
descent. 


9.4.2 Recurrent Neural Networks with Hidden States 


Matters are entirely different when we have hidden states. Let’s look at the structure in 
some more detail. 


Assume that we have a minibatch of inputs X, € R’*@ at time step t. In other words, for 
a minibatch of n sequence examples, each row of X; corresponds to one example at time 
step t from the sequence. Next, denote by H, € R’*" the hidden layer output of time step 
t. Unlike with MLP, here we save the hidden layer output H;-1 from the previous time step 
and introduce a new weight parameter Whn € R”*” to describe how to use the hidden layer 
output of the previous time step in the current time step. Specifically, the calculation of the 
hidden layer output of the current time step is determined by the input of the current time 
step together with the hidden layer output of the previous time step: 


H, = ØX: Wh + H-1 Whn + by). (9.4.5) 
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Compared with (9.4.3), (9.4.5) adds one more term H;-1 Wpn and thus instantiates (9.4.2). 
From the relationship between hidden layer outputs H, and H,;_, of adjacent time steps, 
we know that these variables captured and retained the sequence’s historical information 
up to their current time step, just like the state or memory of the neural network’s current 
time step. Therefore, such a hidden layer output is called a hidden state. Since the hidden 
state uses the same definition of the previous time step in the current time step, the compu- 
tation of (9.4.5) is recurrent. Hence, as we said, neural networks with hidden states based 
on recurrent computation are named recurrent neural networks. Layers that perform the 
computation of (9.4.5) in RNNs are called recurrent layers. 


There are many different ways for constructing RNNs. Those with a hidden state defined 
by (9.4.5) are very common. For time step f, the output of the output layer is similar to the 
computation in the MLP: 


Parameters of the RNN include the weights Wyn € R2%”, Wp € R”*", and the bias by € 
R!*" of the hidden layer, together with the weights Whg € R’*4 and the bias bq € R!*4 
of the output layer. It is worth mentioning that even at different time steps, RNNs always 
use these model parameters. Therefore, the parametrization cost of an RNN does not grow 
as the number of time steps increases. 


Fig. 9.4.1 illustrates the computational logic of an RNN at three adjacent time steps. At 
any time step f, the computation of the hidden state can be treated as: (i) concatenating the 
input X; at the current time step ¢ and the hidden state H,_, at the previous time step t — 1; 
(ii) feeding the concatenation result into a fully connected layer with the activation function 
@. The output of such a fully connected layer is the hidden state H, of the current time step 
t. In this case, the model parameters are the concatenation of W,, and Wypp, and a bias 
of bp, all from (9.4.5). The hidden state of the current time step t, H,, will participate in 
computing the hidden state H,,, of the next time step t + 1. What is more, H, will also be 
fed into the fully connected output layer to compute the output O, of the current time step 
t. 


Output layer 


Hidden state 


Input 


Le | ens 2. Copy Concatenale: 
An RNN with a hidden state. 
We just mentioned that the calculation of X; Wxn + H; -1 Whn for the hidden state is equiv- 


alent to matrix multiplication of the concatenation of X, and H;_, and the concatenation 
of W,, and Wy. Though this can be proven mathematically, in the following we just use 
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a simple code snippet as a demonstration. To begin with, we define matrices X, W_xh, H, 
and W_hh, whose shapes are (3, 1), (1, 4), (3, 4), and (4, 4), respectively. Multiplying X by 
W_xh, and H by W_hh, and then adding these two products, we obtain a matrix of shape (3, 
4). 


X, W_xh = torch.randn(3, 1), torch.randn(1, 4) 
H, W_hh = torch.randn(3, 4), torch.randn(4, 4) 
torch.matmul(X, W_xh) + torch.matmul(H, W_hh) 


tensor([[ 1.2526, 0.0580, -3.3460, -0.2519], 
[-1.3064, 1.4132, -0.1435, 0.3482], 
[ 3.1495, 0.8172, 1.5167, -0.9038]]) 


Now we concatenate the matrices X and H along columns (axis 1), and the matrices W_xh 
and W_hh along rows (axis 0). These two concatenations result in matrices of shape (3, 5) 
and of shape (5, 4), respectively. Multiplying these two concatenated matrices, we obtain 
the same output matrix of shape (3, 4) as above. 


torch.matmul(torch.cat((X, H), 1), torch.cat((W_xh, W_hh), @)) 


tensor([[ 1.2526, 0.0580, -3.3460, -0.2519], 
[-1.3064, 1.4132, -0.1435, 0.3482], 
[ 3.1495, @.8172, 1.5167, -0.9038]]) 


9.4.3 RNN-Based Character-Level Language Models 


Recall that for language modeling in Section 9.3, we aim to predict the next token based on 
the current and past tokens; thus we shift the original sequence by one token as the targets 
(labels). Bengio et al. (2003) first proposed to use a neural network for language modeling. 
In the following we illustrate how RNNs can be used to build a language model. Let the 
minibatch size be one, and the sequence of the text be “machine”. To simplify training 
in subsequent sections, we tokenize text into characters rather than words and consider a 
character-level language model. Fig. 9.4.2 demonstrates how to predict the next charac- 
ter based on the current and previous characters via an RNN for character-level language 
modeling. 


During the training process, we run a softmax operation on the output from the output layer 
for each time step, and then use the cross-entropy loss to compute the error between the 
model output and the target. Because of the recurrent computation of the hidden state in the 
hidden layer, the output, O3, of time step 3 in Fig. 9.4.2 is determined by the text sequence 


39 6699 
a 


“m”, “a”, and “c”. Since the next character of the sequence in the training data is “h”, the 


loss of time step 3 will depend on the probability distribution of the next character generated 


99 66,99 6699 


based on the feature sequence “m”, “a”, “c” and the target “h” of this time step. 


In practice, each token is represented by a d-dimensional vector, and we use a batch size 
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Time step 1 2 3 4 5 6 
Target sequence a c h i n e 

Output 

layer 

hier 

layer 


Input sequence m a c h i n 


C Acharacter-level language model based on the RNN. The input and target sequences are 
“machin” and “achine”, respectively. 


n > 1. Therefore, the input X; at time step t will be an n x d matrix, which is identical to 
what we discussed in Section 9.4.2. 


In the following sections, we will implement RNNs for character-level language mod- 
els. 


9.4.4 Summary 


A neural network that uses recurrent computation for hidden states is called a recurrent 
neural network (RNN). The hidden state of an RNN can capture historical information of 
the sequence up to the current time step. With recurrent computation, the number of RNN 
model parameters does not grow as the number of time steps increases. As for applications, 
an RNN can be used to create character-level language models. 


9.4.5 Exercises 


1. If we use an RNN to predict the next character in a text sequence, what is the required 
dimension for any output? 


2. Why can RNNs express the conditional probability of a token at some time step based 
on all the previous tokens in the text sequence? 


3. What happens to the gradient if you backpropagate through a long sequence? 


4. What are some of the problems associated with the language model described in this 


section? 
: 7.141 
ay Discussions**~ . 
maam 
i 


9.5 Recurrent Neural Network Implementation 
from Scratch 


We are now ready to implement an RNN from scratch. In particular, we will train this 
RNN to function as a character-level language model (see Section 9.4) and train it on a 


353 Recurrent Neural Network Implementation from Scratch 


corpus consisting of the entire text of H. G. Wells’ The Time Machine, following the data 
processing steps outlined in Section 9.2. We start by loading the dataset. 


zmatplotlib inline 

import math 

import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


9.5.1 RNN Model 


We begin by defining a class to implement the RNN model (Section 9.4.2). Note that the 
number of hidden units num_hiddens is a tunable hyperparameter. 


class RNNScratch(d21.Module): #@save 
"""The RNN model implemented from scratch.”"" 
def __init__(self, num_inputs, num_hiddens, sigma=0.01): 
super().__init__Q 
self.save_hyperparameters() 
self .W_xh = nn.Parameter( 
torch.randn(num_inputs, num_hiddens) * sigma) 
self.W_hh = nn.Parameter ( 
torch.randn(num_hiddens, num_hiddens) * sigma) 
self.b_h = nn.Parameter (torch. zeros(num_hiddens) ) 


The forward method below defines how to compute the output and hidden state at any time 
step, given the current input and the state of the model at the previous time step. Note that 
the RNN model loops through the outermost dimension of inputs, updating the hidden 
state one time step at a time. The model here uses a tanh activation function (Section 
5.1.2). 


@d21.add_to_class(RNNScratch) #@save 
def forward(self, inputs, state=None): 
if state is None: 
# Initial state with shape: (batch_size, num_hiddens) 
state = torch.zeros((inputs.shape[1], self.num_hiddens) , 
device=inputs.device) 
else: 
state, = state 
outputs = [] 
for X in inputs: # Shape of inputs: (num_steps, batch_size, num_inputs) 
state = torch.tanh(torch.matmul(X, self.W_xh) + 
torch.matmul(state, self.W_hh) + self.b_h) 
outputs. append(state) 
return outputs, state 


We can feed a minibatch of input sequences into an RNN model as follows. 


batch_size, num_inputs, num_hiddens, num_steps = 2, 16, 32, 100 


(continues on next page) 
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(continued from previous page) 


rnn = RNNScratch(num_inputs, num_hiddens) 
X = torch.ones((num_steps, batch_size, num_inputs)) 
outputs, state = rnn(X) 


Let’s check whether the RNN model produces results of the correct shapes to ensure that 
the dimensionality of the hidden state remains unchanged. 


def check_len(a, n): #@save 
"""Check the length of a Nist ami 
assert len(a) == n, f’list\'’s length {len(a)} != expected length {n}’ 


def check_shape(a, shape): #@save 

"""Check the shape of a tensor. 
assert a.shape == shape, \ 

f'tensor\'s shape {a.shape} != expected shape {shape}’ 


nnn 


check_len(outputs, num_steps) 
check_shape(outputs[@], (batch_size, num_hiddens)) 
check_shape(state, (batch_size, num_hiddens)) 


9.5.2 RNN-Based Language Model 


The following RNNLMScratch class defines an RNN-based language model, where we pass 
in our RNN via the rnn argument of the __init__ method. When training language mod- 
els, the inputs and outputs are from the same vocabulary. Hence, they have the same di- 
mension, which is equal to the vocabulary size. Note that we use perplexity to evaluate the 
model. As discussed in Section 9.3.2, this ensures that sequences of different length are 
comparable. 


class RNNLMScratch(d21.Classifier): #@save 
"""The RNN-based language model implemented from scratch.””” 
def __init__(self, rnn, vocab_size, 1r=0.01): 
super().__init__Q 
self.save_hyperparameters() 
self.init_params() 


def init_params(self): 
self.W_hq = nn.Parameter( 
torch. randn( 
self.rnn.num_hiddens, self.vocab_size) * self.rnn.sigma) 
self.b_q = nn.Parameter(torch.zeros(self.vocab_size)) 


def training_step(self, batch): 
1 = self.loss(self(*batch[:-1]), batch[-1]) 
self.plot('ppl’, torch.exp(1), train=True) 
return 1 


def validation_step(self, batch): 
1 = self.loss(self(*batch[:-1]), batch[-1]) 
self.plot('ppl', torch.exp(1), train=False) 
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One-Hot Encoding 


Recall that each token is represented by a numerical index indicating the position in the 
vocabulary of the corresponding word/character/word piece. You might be tempted to build 
a neural network with a single input node (at each time step), where the index could be fed 
in as a scalar value. This works when we are dealing with numerical inputs like price or 
temperature, where any two values sufficiently close together should be treated similarly. 
But this does not quite make sense. The 45" and 46" words in our vocabulary happen to 
be “their” and “said”, whose meanings are not remotely similar. 


When dealing with such categorical data, the most common strategy is to represent each 
item by a one-hot encoding (recall from Section 4.1.1). A one-hot encoding is a vector 
whose length is given by the size of the vocabulary N, where all entries are set to 0, except 
for the entry corresponding to our token, which is set to 1. For example, if the vocabulary 
had five elements, then the one-hot vectors corresponding to indices 0 and 2 would be the 
following. 


F.one_hot(torch.tensor([@, 2]), 5) 


tensor([[1, 2, ð, ð, 0], 
[@, ©, 1, ®@, @]]) 


The minibatches that we sample at each iteration will take the shape (batch size, number 
of time steps). Once representing each input as a one-hot vector, we can think of each 
minibatch as a three-dimensional tensor, where the length along the third axis is given by 
the vocabulary size (len(vocab)). We often transpose the input so that we will obtain an 
output of shape (number of time steps, batch size, vocabulary size). This will allow us to 
loop more conveniently through the outermost dimension for updating hidden states of a 
minibatch, time step by time step (e.g., in the above forward method). 


@d21.add_to_class(RNNLMScratch) #@save 
def one_hot(self, X): 
# Output shape: (num_steps, batch_size, vocab_size) 
return F.one_hot(X.T, self.vocab_size) .type(torch. float32) 


Transforming RNN Outputs 


The language model uses a fully connected output layer to transform RNN outputs into 
token predictions at each time step. 


@d21.add_to_class(RNNLMScratch) #@save 

def output_layer(self, rnn_outputs): 
outputs = [torch.matmul(H, self.W_hq) + self.b_q for H in rnn_outputs] 
return torch.stack(outputs, 1) 


@d21.add_to_class(RNNLMScratch) #@save 


(continues on next page) 
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def forward(self, X, state=None): 
embs = self.one_hot(X) 
rnn_outputs, _ = self.rnn(embs, state) 
return self.output_layer(rnn_outputs) 


Let’s check whether the forward computation produces outputs with the correct shape. 


model = RNNLMScratch(rnn, num_inputs) 
outputs = model(torch.ones((batch_size, num_steps), dtype=torch. int64)) 
check_shape(outputs, (batch_size, num_steps, num_inputs)) 


9.5.3 Gradient Clipping 


While you are already used to thinking of neural networks as “deep” in the sense that many 
layers separate the input and output even within a single time step, the length of the se- 
quence introduces a new notion of depth. In addition to the passing through the network 
in the input-to-output direction, inputs at the first time step must pass through a chain of T 
layers along the time steps in order to influence the output of the model at the final time 
step. Taking the backwards view, in each iteration, we backpropagate gradients through 
time, resulting in a chain of matrix-products of length O (T). As mentioned in Section 5.4, 
this can result in numerical instability, causing the gradients either to explode or vanish, 
depending on the properties of the weight matrices. 


Dealing with vanishing and exploding gradients is a fundamental problem when designing 
RNNs and has inspired some of the biggest advances in modern neural network architec- 
tures. In the next chapter, we will talk about specialized architectures that were designed in 
hopes of mitigating the vanishing gradient problem. However, even modern RNNs often 
suffer from exploding gradients. One inelegant but ubiquitous solution is to simply clip the 
gradients forcing the resulting “clipped” gradients to take smaller values. 


Generally speaking, when optimizing some objective by gradient descent, we iteratively 
update the parameter of interest, say a vector x, but pushing it in the direction of the negative 
gradient g (in stochastic gradient descent, we calculate this gradient on a randomly sampled 
minibatch). For example, with learning rate 7 > 0, each update takes the form x — x—7g. 
Let’s further assume that the objective function f is sufficiently smooth. Formally, we say 
that the objective is Lipschitz continuous with constant L, meaning that for any x and y, 
we have 


If(x) — f(y)| < Lx- yll. (9.5.1) 


As you can see, when we update the parameter vector by subtracting 7g, the change in 
the value of the objective depends on the learning rate, the norm of the gradient and L as 
follows: 


f(x) — f(«— ng)| < Lnligll. (9.5.2) 


In other words, the objective cannot change by more than L7||g||. Having a small value for 
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this upper bound might be viewed as good or bad. On the downside, we are limiting the 
speed at which we can reduce the value of the objective. On the bright side, this limits by 
just how much we can go wrong in any one gradient step. 


When we say that gradients explode, we mean that ||g|| becomes excessively large. In this 
worst case, we might do so much damage in a single gradient step that we could undo all 
of the progress made over the course of thousands of training iterations. When gradients 
can be so large, neural network training often diverges, failing to reduce the value of the 
objective. At other times, training eventually converges but is unstable owing to massive 
spikes in the loss. 


One way to limit the size of L7||g|| is to shrink the learning rate 7 to tiny values. This 
has the advantage that we do not bias the updates. But what if we only rarely get large 
gradients? This drastic move slows down our progress at all steps, just to deal with the rare 
exploding gradient events. A popular alternative is to adopt a gradient clipping heuristic 
projecting the gradients g onto a ball of some given radius 8 as follows: 


g — min (1, a] g. (9.5.3) 
IIgll 

This ensures that the gradient norm never exceeds 6 and that the updated gradient is entirely 
aligned with the original direction of g. It also has the desirable side-effect of limiting the 
influence any given minibatch (and within it any given sample) can exert on the parameter 
vector. This bestows a certain degree of robustness to the model. To be clear, it is a hack. 
Gradient clipping means that we are not always following the true gradient and it is hard to 
reason analytically about the possible side effects. However, it is a very useful hack, and is 
widely adopted in RNN implementations in most deep learning frameworks. 


Below we define a method to clip gradients, which is invoked by the fit_epoch method 
of the d21. Trainer class (see Section 3.4). Note that when computing the gradient norm, 
we are concatenating all model parameters, treating them as a single giant parameter vec- 
tor. 


@d21.add_to_class(d21.Trainer) #@save 
def clip_gradients(self, grad_clip_val, model): 
params = [p for p in model.parameters() if p.requires_grad] 
norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params)) 
if norm > grad_clip_val: 
for param in params: 
param.grad[:] *= grad_clip_val / norm 


9.5.4 Training 


Using The Time Machine dataset (data), we train a character-level language model (mode1) 
based on the RNN (rnn) implemented from scratch. Note that we first calculate the gra- 
dients, then clip them, and finally update the model parameters using the clipped gradi- 
ents. 
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data = d21.TimeMachine(batch_size=1024, num_steps=32) 

rnn = RNNScratch(num_inputs=len(data.vocab), num_hiddens=32) 

model = RNNLMScratch(rnn, vocab_size=len(data.vocab), 1r=1) 

trainer = d21.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1) 
trainer.fit(model, data) 


— train_ppl 
==- val ppl 


9.5.5 Decoding 


Once a language model has been learned, we can use it not only to predict the next token 
but to continue predicting each subsequent one, treating the previously predicted token as 
though it were the next in the input. Sometimes we will just want to generate text as though 
we were starting at the beginning of a document. However, itis often useful to condition the 
language model on a user-supplied prefix. For example, if we were developing an autocom- 
plete feature for a search engine or to assist users in writing emails, we would want to feed 
in what they had written so far (the prefix), and then generate a likely continuation. 


The following predict method generates a continuation, one character at a time, after 
ingesting a user-provided prefix. When looping through the characters in prefix, we 
keep passing the hidden state to the next time step but do not generate any output. This is 
called the warm-up period. After ingesting the prefix, we are now ready to begin emitting 
the subsequent characters, each of which will be fed back into the model as the input at the 
next time step. 


@d21.add_to_class(RNNLMScratch) #@save 
def predict(self, prefix, num_preds, vocab, device=None): 
state, outputs = None, [vocab[prefix[0]]] 
for i in range(len(prefix) + num_preds - 1): 
X = torch.tensor(LLoutputs[-1]]], device=device) 
embs = self.one_hot(X) 
rnn_outputs, state = self.rnn(embs, state) 
if i < len(prefix) - 1: # Warm-up period 
outputs.append(vocab[prefixLi + 1]]) 
else: # Predict num_preds steps 
Y = self.output_layer(rnn_outputs) 
outputs. append(int (Y.argmax(axis=2).reshape(1))) 


return ''.join(Lvocab.idx_to_token[i] for i in outputs]) 


In the following, we specify the prefix and have it generate 20 additional characters. 
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model.predict(’it has’, 20, data.vocab, d21.try_gpu()) 


"it has in the the the the ' 


While implementing the above RNN model from scratch is instructive, it is not convenient. 
In the next section, we will see how to leverage deep learning frameworks to whip up RNNs 
using standard architectures, and to reap performance gains by relying on highly optimized 
library functions. 


9.5.6 Summary 


We can train RNN-based language models to generate text following the user-provided text 
prefix. A simple RNN language model consists of input encoding, RNN modeling, and 
output generation. During training, gradient clipping can mitigate the problem of explod- 
ing gradients but does not address the problem of vanishing gradients. In the experiment, 
we implemented a simple RNN language model and trained it with gradient clipping on se- 
quences of text, tokenized at the character level. By conditioning on a prefix, we can use a 
language model to generate likely continuations, which proves useful in many applications, 
e.g., autocomplete features. 


9.5.7 Exercises 


1. Does the implemented language model predict the next token based on all the past tokens 
up to the very first token in The Time Machine? 


2. Which hyperparameter controls the length of history used for prediction? 


3. Show that one-hot encoding is equivalent to picking a different embedding for each 
object. 


4. Adjust the hyperparameters (e.g., number of epochs, number of hidden units, number 
of time steps in a minibatch, and learning rate) to improve the perplexity. How low can 
you go while sticking with this simple architecture? 


5. Replace one-hot encoding with learnable embeddings. Does this lead to better perfor- 
mance? 


6. Conduct an experiment to determine how well this language model trained on The Time 
Machine works on other books by H. G. Wells, e.g., The War of the Worlds. 


7. Conduct another experiment to evaluate the perplexity of this model on books written 
by other authors. 


8. Modify the prediction method so as to use sampling rather than picking the most likely 
next character. 


e What happens? 
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e Bias the model towards more likely outputs, e.g., by sampling from g(x; | x;-1,...,%1) « 
P(x; | X¢-1,---,%1)% fora > 1. 


9. Run the code in this section without clipping the gradient. What happens? 


10. Replace the activation function used in this section with ReLU and repeat the experi- 
ments in this section. Do we still need gradient clipping? Why? 


Discussions !4?. 


9.6 Concise Implementation of Recurrent Neural 
Networks 


Like most of our from-scratch implementations, Section 9.5 was designed to provide in- 
sight into how each component works. But when you are using RNNs every day or writing 
production code, you will want to rely more on libraries that cut down on both implemen- 
tation time (by supplying library code for common models and functions) and computation 
time (by optimizing the heck out of these library implementations). This section will show 
you how to implement the same language model more efficiently using the high-level API 
provided by your deep learning framework. We begin, as before, by loading The Time 
Machine dataset. 


import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


9.6.1 Defining the Model 
We define the following class using the RNN implemented by high-level APIs. 


class RNN(d21.Module): #@save 
"""The RNN model implemented with high-level APIs.””” 
def __init__(self, num_inputs, num_hiddens): 
super().__init__Q 
self.save_hyperparameters() 
self.rnn = nn.RNN(num_inputs, num_hiddens) 


def forward(self, inputs, H=None): 
return self.rnn(inputs, H) 


Inheriting from the RNNLMScratch class in Section 9.5, the following RNNLM class defines 
a complete RNN-based language model. Note that we need to create a separate fully con- 
nected output layer. 
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class RNNLM(d21.RNNLMScratch): #@save 
"""The RNN-based language model implemented with high-level APIs.""" 
def init_params(self): 
self.linear = nn.LazyLinear(self.vocab_size) 


def output_layer(self, hiddens): 
return self.linear(hiddens).swapaxes(@, 1) 


9.6.2 Training and Predicting 


Before training the model, let’s make a prediction with a model initialized with random 
weights. Given that we have not trained the network, it will generate nonsensical predic- 
tions. 


data = d21.TimeMachine(batch_size=1024, num_steps=32) 
rnn = RNN(num_inputs=len(data. vocab), num_hiddens=32) 
model = RNNLM(rnn, vocab_size=len(data.vocab), Ir=1) 
model .predict(’it has’, 20, data. vocab) 


"it hasoadd dd dd dd dd dd ' 


Next, we train our model, leveraging the high-level API. 


trainer = d21.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1) 
trainer.fit(model, data) 


— train_ppl 
==- val ppl 


20 


Compared with Section 9.5, this model achieves comparable perplexity, but runs faster due 
to the optimized implementations. As before, we can generate predicted tokens following 
the specified prefix string. 


model.predict(’it has’, 20, data.vocab, d21.try_gpu()) 


"it has and the trave the t' 


9.6.3 Summary 
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High-level APIs in deep learning frameworks provide implementations of standard RNNs. 
These libraries help you to avoid wasting time reimplementing standard models. Moreover, 
framework implementations are often highly optimized, leading to significant (computa- 
tional) performance gains when compared with implementations from scratch. 


9.6.4 Exercises 
1. Can you make the RNN model overfit using the high-level APIs? 


2. Implement the autoregressive model of Section 9.1 using an RNN. 


Discussions !4°. 


9.7 Backpropagation Through Time 
SEES es 


If you completed the exercises in Section 9.5, you would have seen that gradient clipping is 
vital for preventing the occasional massive gradients from destabilizing training. We hinted 
that the exploding gradients stem from backpropagating across long sequences. Before in- 
troducing a slew of modern RNN architectures, let’s take a closer look at how backprop- 
agation works in sequence models in mathematical detail. Hopefully, this discussion will 
bring some precision to the notion of vanishing and exploding gradients. If you recall our 
discussion of forward and backward propagation through computational graphs when we 
introduced MLPs in Section 5.3, then forward propagation in RNNs should be relatively 
straightforward. Applying backpropagation in RNNs is called backpropagation through 
time (Werbos, 1990). This procedure requires us to expand (or unroll) the computational 
graph of an RNN one time step at a time. The unrolled RNN is essentially a feedforward 
neural network with the special property that the same parameters are repeated throughout 
the unrolled network, appearing at each time step. Then, just as in any feedforward neural 
network, we can apply the chain rule, backpropagating gradients through the unrolled net. 
The gradient with respect to each parameter must be summed across all places that the pa- 
rameter occurs in the unrolled net. Handling such weight tying should be familiar from our 
chapters on convolutional neural networks. 


Complications arise because sequences can be rather long. It is not unusual to work with 
text sequences consisting of over a thousand tokens. Note that this poses problems both 
from a computational (too much memory) and optimization (numerical instability) stand- 
point. Input from the first step passes through over 1000 matrix products before arriving 
at the output, and another 1000 matrix products are required to compute the gradient. We 
now analyze what can go wrong and how to address it in practice. 


9.7.1 Analysis of Gradients in RNNs 


We start with a simplified model of how an RNN works. This model ignores details about 
the specifics of the hidden state and how it is updated. The mathematical notation here does 
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not explicitly distinguish scalars, vectors, and matrices. We are just trying to develop some 
intuition. In this simplified model, we denote h, as the hidden state, x; as input, and oç as 
output at time step t. Recall our discussions in Section 9.4.2 that the input and the hidden 
state can be concatenated before being multiplied by one weight variable in the hidden layer. 
Thus, we use wp and w, to indicate the weights of the hidden layer and the output layer, 
respectively. As a result, the hidden states and outputs at each time step are 


hy = f (xt, ht-1, Wh), (9.7.1) 

Or = B(hr, Wo), 
where f and g are transformations of the hidden layer and the output layer, respectively. 
Hence, we have a chain of values {..., (x+-1, At-1, 0r-1), (Xt, At, 0t), .. .} that depend on 
each other via recurrent computation. The forward propagation is fairly straightforward. 
All we need is to loop through the (x+, h+, 0+) triples one time step at a time. The discrep- 
ancy between output o; and the desired target y+; is then evaluated by an objective function 
across all the T time steps as 


T 
1 
L(x1,..., XT, Y1, ++ -> YT, Wh, Wo) = 7 Ds I(yt, 07). (9.7.2) 


t=1 


For backpropagation, matters are a bit trickier, especially when we compute the gradients 
with regard to the parameters wp of the objective function L. To be specific, by the chain 
rule, 


T 


OL = 1 llyr, Or) 
Ow, T 2, Own 
5 (9.7.3) 
= 1 X llyr, Or) ôg (h, Wo) Oh, 
T 00; ðh; Own 


t=1 


The first and the second factors of the product in (9.7.3) are easy to compute. The third 
factor Oh; /Owy is where things get tricky, since we need to recurrently compute the effect 
of the parameter wp on hs. According to the recurrent computation in (9.7.1), 4; depends 
on both h;_; and wp, where computation of ;_; also depends on wy. Thus, evaluating the 
total derivate of h; with respect to wp using the chain rule yields 


Oh; = Of (xt, hy-1, Wh) re Of (x1, hy-1, Wh) Ohy-1 
dwn ðwn hı Own | 


(9.7.4) 


To derive the above gradient, assume that we have three sequences {a;}, {bz}, {c+} satisfy- 
ing ao = O and a; = b; + crar- fort = 1,2,.... Then fort > 1, itis easy to show 


at shey B Cj bi. (9.7.5) 
1 


i=l \ j=i+1 
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By substituting a;, b;, and c; according to 


q a 2h 
i Own 
Of (xt, hy-1, Wh) 
b, = —— bales 
i Fa (9.7.6) 
gre Of (xt, ht-1, Wn) 
t ðh, ’ 


the gradient computation in (9.7.4) satisfies a, = b; + c,a;_1. Thus, per (9.7.5), we can 
remove the recurrent computation in (9.7.4) with 


ðh, — Of (Xt, At-1, Wh) B Of (xj, hj-1, wn) | Of (xi, hi-1, Wh) 
= + ; 


9.7.7 
Own OWn Ohj-1 Wh ( ) 


i=1 \j=i+1 


While we can use the chain rule to compute ðh, /ôwp recursively, this chain can get very 
long whenever t is large. Let’s discuss a number of strategies for dealing with this prob- 
lem. 


Full Computation 


One idea might be to compute the full sum in (9.7.7). However, this is very slow and 
gradients can blow up, since subtle changes in the initial conditions can potentially affect 
the outcome a lot. That is, we could see things similar to the butterfly effect, where minimal 
changes in the initial conditions lead to disproportionate changes in the outcome. This is 
generally undesirable. After all, we are looking for robust estimators that generalize well. 
Hence this strategy is almost never used in practice. 


Truncating Time Steps 


Alternatively, we can truncate the sum in (9.7.7) after t steps. This is what we have been 
discussing so far. This leads to an approximation of the true gradient, simply by terminating 
the sum at 0h;_;/OWy. In practice this works quite well. It is what is commonly referred 
to as truncated backpropgation through time (Jaeger, 2002). One of the consequences of 
this is that the model focuses primarily on short-term influence rather than long-term con- 
sequences. This is actually desirable, since it biases the estimate towards simpler and more 
stable models. 


Randomized Truncation 


Last, we can replace 0h;/Owy by a random variable which is correct in expectation but 
truncates the sequence. This is achieved by using a sequence of é; with predefined 0 < 
m, < 1, where P(é, = 0) = 1 — z, and P(é; = i) = 7,, thus E[é,] = 1. We use this to 
replace the gradient ðh, /Owy in (9.7.4) with 


Z Of (xt, he-1, Wh) i Of (xt, ht-1, Wh) Ohy-1 


Own f ðh- Own es) 


Zt 
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It follows from the definition of £, that E[z;] = 0h;/Owp. Whenever £; = 0 the recurrent 
computation terminates at that time step t. This leads to a weighted sum of sequences of 
varying lengths, where long sequences are rare but appropriately overweighted. This idea 
was proposed by Tallec and Ollivier (2017). 


Comparing Strategies 


the time machine by h g wells 


Comparing strategies for computing gradients in RNNs. From top to bottom: randomized 
truncation, regular truncation, and full computation. 


Fig. 9.7.1 illustrates the three strategies when analyzing the first few characters of The Time 
Machine using backpropagation through time for RNNs: 


e The first row is the randomized truncation that partitions the text into segments of varying 
lengths. 


e The second row is the regular truncation that breaks the text into subsequences of the 
same length. This is what we have been doing in RNN experiments. 


e The third row is the full backpropagation through time that leads to a computationally 
infeasible expression. 


Unfortunately, while appealing in theory, randomized truncation does not work much bet- 
ter than regular truncation, most likely due to a number of factors. First, the effect of an 
observation after a number of backpropagation steps into the past is quite sufficient to cap- 
ture dependencies in practice. Second, the increased variance counteracts the fact that the 
gradient is more accurate with more steps. Third, we actually want models that have only 
a short range of interactions. Hence, regularly truncated backpropagation through time has 
a slight regularizing effect that can be desirable. 


9.7.2 Backpropagation Through Time in Detail 


After discussing the general principle, let’s discuss backpropagation through time in detail. 
In contrast to the analysis in Section 9.7.1, in the following we will show how to compute the 
gradients of the objective function with respect to all the decomposed model parameters. 
To keep things simple, we consider an RNN without bias parameters, whose activation 
function in the hidden layer uses the identity mapping (¢(x) = x). For time step f¢, let 
the single example input and the target be x; € Rf and y+, respectively. The hidden state 
h; € R” and the output o; € R4 are computed as 


h; = WoxXt + Whnhy-1, 


(9.7.9) 
Oo; = Wah:, 
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where W}x € R’*4, Wh, € R’*", and Wan € RX" are the weight parameters. Denote 
by /(0;, yz) the loss at time step t. Our objective function, the loss over T time steps from 
the beginning of the sequence is thus 


T 


1 
L=7 ` Lor ye). (9.7.10) 


t=1 


In order to visualize the dependencies among model variables and parameters during com- 
putation of the RNN, we can draw a computational graph for the model, as shown in Fig. 
9.7.2. For example, the computation of the hidden states of time step 3, h3, depends on 
the model parameters W} x and Wypp, the hidden state of the previous time step hz, and the 
input of the current time step x3. 


Computational graph showing dependencies for an RNN model with three time steps. 
Boxes represent variables (not shaded) or parameters (shaded) and circles represent 
operators. 


As just mentioned, the model parameters in Fig. 9.7.2 are Whx, Whn, and Won. Gen- 
erally, training this model requires gradient computation with respect to these parameters 
OL/OWpx, OL/OW m, and L/W qn. According to the dependencies in Fig. 9.7.2, we can 
traverse in the opposite direction of the arrows to calculate and store the gradients in turn. 
To flexibly express the multiplication of matrices, vectors, and scalars of different shapes 
in the chain rule, we continue to use the prod operator as described in Section 5.3. 


First of all, differentiating the objective function with respect to the model output at any 
time step t is fairly straightforward: 


OL _ Ory) © pa 


9.7.11 
ðo; T. ðo; ( ) 


Now we can calculate the gradient of the objective with respect to the parameter Wn in 


the output layer: L/W qh € R1% Based on Fig. 9.7.2, the objective L depends on Wan 
via 01, ...,Or. Using the chain rule yields 


a ðL ðo; ðL 
ne a| 2 =y <n? 9.7.12 
IW an D (5 | Ži o t eles) 


where 0L/0o; is given by (9.7.11). 


Next, as shown in Fig. 9.7.2, at the final time step T, the objective function L depends on 
the hidden state hy only via or. Therefore, we can easily find the gradient 0L/Ohr € R” 
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using the chain rule: 


OL OL dor OL 
LAE S2 ENAWE, 9.7.13 
Ohr (Z Zor) ( ) 


qh Oor 

It gets trickier for any time step t < T, where the objective function L depends on h; via 
h;+ı and o;. According to the chain rule, the gradient of the hidden state 0L/Oh, € R” at 
any time step t < T can be recurrently computed as: 


aL oa| aL Met) ra| 2 a _ OL OL ia 


OL OL ðo) wr L wr OL 
hni’ Oh; do Ih) Thy P 3o; 


For analysis, expanding the recurrent computation for any time step 1 < t < T gives 


. 9.7.15 
ah OOTs+t-i ( ) 


Li -i L 
Wer ae 

We can see from (9.7.15) that this simple linear example already exhibits some key prob- 
lems of long sequence models: it involves potentially very large powers of Win: In it, 
eigenvalues smaller than 1 vanish and eigenvalues larger than 1 diverge. This is numer- 
ically unstable, which manifests itself in the form of vanishing and exploding gradients. 
One way to address this is to truncate the time steps at a computationally convenient size as 
discussed in Section 9.7.1. In practice, this truncation can also be effected by detaching the 
gradient after a given number of time steps. Later on, we will see how more sophisticated 
sequence models such as long short-term memory can alleviate this further. 


Finally, Fig. 9.7.2 shows that the objective function L depends on model parameters Wpx 
and Wņpn in the hidden layer via hidden states hi,..., hy. To compute gradients with 
respect to such parameters 0L/0Whx € R?” and OL/OWn € R”*}, we apply the chain 
rule giving 


aL ðL oh, ðL 
= d| 2 = kutas 
3Whx 2 pro (Z, we) > Jb,” 
(9.7.16) 
aL Sa) oo dh, S lig 
Wm eo \ oh,’ Wa) Oh, 


where 0L/Oh, which is recurrently computed by (9.7.13) and (9.7.14) is the key quantity 
that affects the numerical stability. 


Since backpropagation through time is the application of backpropagation in RNNs, as we 
have explained in Section 5.3, training RNNs alternates forward propagation with back- 
propagation through time. Moreover, backpropagation through time computes and stores 
the above gradients in turn. Specifically, stored intermediate values are reused to avoid du- 
plicate calculations, such as storing 0L/Oh, to be used in computation of both 6L/0W} x 
and 0L/OWhn. 


9.7.3 Summary 
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Backpropagation through time is merely an application of backpropagation to sequence 
models with a hidden state. Truncation, such as regular or randomized, is needed for com- 
putational convenience and numerical stability. High powers of matrices can lead to diver- 
gent or vanishing eigenvalues. This manifests itself in the form of exploding or vanishing 
gradients. For efficient computation, intermediate values are cached during backpropaga- 
tion through time. 


9.7.4 Exercises 


1. Assume that we have a symmetric matrix M € R”*” with eigenvalues 2; whose cor- 
responding eigenvectors are v; (i = 1,...,n). Without loss of generality, assume that 
they are ordered in the order |A;| > |Aj+1|. 


1. Show that M* has eigenvalues ak ; 


2. Prove that for a random vector x € R”, with high probability M‘x will be very much 
aligned with the eigenvector vı of M. Formalize this statement. 


3. What does the above result mean for gradients in RNNs? 


2. Besides gradient clipping, can you think of any other methods to cope with gradient 
explosion in recurrent neural networks? 


Discussions 144. 
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The previous chapter introduced the key ideas behind recurrent neural networks (RNNs). 
However, just as with convolutional neural networks, there has been a tremendous amount 
of innovation in RNN architectures, culminating in several complex designs that have 
proven successful in practice. In particular, the most popular designs feature mechanisms 
for mitigating the notorious numerical instability faced by RNNs, as typified by vanishing 
and exploding gradients. Recall that in Chapter 9 we dealt with exploding gradients by ap- 
plying a blunt gradient clipping heuristic. Despite the efficacy of this hack, it leaves open 
the problem of vanishing gradients. 


In this chapter, we introduce the key ideas behind the most successful RNN architectures for 
sequences, which stem from two papers. The first, Long Short-Term Memory (Hochreiter 
and Schmidhuber, 1997), introduces the memory cell, a unit of computation that replaces 
traditional nodes in the hidden layer of a network. With these memory cells, networks 
are able to overcome difficulties with training encountered by earlier recurrent networks. 
Intuitively, the memory cell avoids the vanishing gradient problem by keeping values in 
each memory cell’s internal state cascading along a recurrent edge with weight 1 across 
many successive time steps. A set of multiplicative gates help the network to determine not 
only the inputs to allow into the memory state, but when the content of the memory state 
should influence the model’s output. 


The second paper, Bidirectional Recurrent Neural Networks (Schuster and Paliwal, 1997), 
introduces an architecture in which information from both the future (subsequent time steps) 
and the past (preceding time steps) are used to determine the output at any point in the se- 
quence. This is in contrast to previous networks, in which only past input can affect the 
output. Bidirectional RNNs have become a mainstay for sequence labeling tasks in natu- 
ral language processing, among a myriad of other tasks. Fortunately, the two innovations 
are not mutually exclusive, and have been successfully combined for phoneme classification 
(Graves and Schmidhuber, 2005) and handwriting recognition (Graves et al., 2008). 


The first sections in this chapter will explain the LSTM architecture, a lighter-weight version 
called the gated recurrent unit (GRU), the key ideas behind bidirectional RNNs and a brief 
explanation of how RNN layers are stacked together to form deep RNNs. Subsequently, 
we will explore the application of RNNs in sequence-to-sequence tasks, introducing ma- 
chine translation along with key ideas such as encoder—decoder architectures and beam 
search. 
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10.1 Long Short-Term Memory (LSTM) 
———————————————————— See 


Shortly after the first Elman-style RNNs were trained using backpropagation (Elman, 1990), 
the problems of learning long-term dependencies (owing to vanishing and exploding gra- 
dients) became salient, with Bengio and Hochreiter discussing the problem (Bengio et al., 
1994, Hochreiter et al., 2001). Hochreiter had articulated this problem as early as 1991 in 
his Master’s thesis, although the results were not widely known because the thesis was writ- 
ten in German. While gradient clipping helps with exploding gradients, handling vanishing 
gradients appears to require a more elaborate solution. One of the first and most successful 
techniques for addressing vanishing gradients came in the form of the long short-term mem- 
ory (LSTM) model due to Hochreiter and Schmidhuber (1997). LSTMs resemble standard 
recurrent neural networks but here each ordinary recurrent node is replaced by a memory 
cell. Each memory cell contains an internal state, i.e., a node with a self-connected re- 
current edge of fixed weight 1, ensuring that the gradient can pass across many time steps 
without vanishing or exploding. 


The term “long short-term memory” comes from the following intuition. Simple recurrent 
neural networks have long-term memory in the form of weights. The weights change slowly 
during training, encoding general knowledge about the data. They also have short-term 
memory in the form of ephemeral activations, which pass from each node to successive 
nodes. The LSTM model introduces an intermediate type of storage via the memory cell. 
A memory cell is a composite unit, built from simpler nodes in a specific connectivity 
pattern, with the novel inclusion of multiplicative nodes. 


import torch 
from torch import nn 
from d21 import torch as d21 


10.1.1 Gated Memory Cell 


Each memory cell is equipped with an internal state and a number of multiplicative gates 
that determine whether (i) a given input should impact the internal state (the input gate), 
(ii) the internal state should be flushed to 0 (the forget gate), and (iii) the internal state of a 
given neuron should be allowed to impact the cell’s output (the output gate). 


Gated Hidden State 


The key distinction between vanilla RNNs and LSTMs is that the latter support gating of 
the hidden state. This means that we have dedicated mechanisms for when a hidden state 
should be updated and also for when it should be reset. These mechanisms are learned and 
they address the concerns listed above. For instance, if the first token is of great importance 
we will learn not to update the hidden state after the first observation. Likewise, we will 
learn to skip irrelevant temporary observations. Last, we will learn to reset the latent state 
whenever needed. We discuss this in detail below. 


Long Short-Term Memory (LSTM) 


Input Gate, Forget Gate, and Output Gate 


The data feeding into the LSTM gates are the input at the current time step and the hidden 
state of the previous time step, as illustrated in Fig. 10.1.1. Three fully connected layers 
with sigmoid activation functions compute the values of the input, forget, and output gates. 
As a result of the sigmoid activation, all values of the three gates are in the range of (0, 1). 
Additionally, we require an input node, typically computed with a tanh activation func- 
tion. Intuitively, the input gate determines how much of the input node’s value should be 
added to the current memory cell internal state. The forget gate determines whether to keep 
the current value of the memory or flush it. And the output gate determines whether the 
memory cell should influence the output at the current time step. 


Forget Input Output 
gate gate gate 


F, L 
Input X, 


FC layer with 
Le | activation function f _ copy p 7 Concatenate 


Computing the input gate, the forget gate, and the output gate in an LSTM model. 


oO, 


= 


Hidden state 


H, 


Mathematically, suppose that there are h hidden units, the batch size is n, and the number 
of inputs is d. Thus, the input is X, € R’*@ and the hidden state of the previous time step 
is H,_; € R”*". Correspondingly, the gates at time step f are defined as follows: the input 
gate is I, € R”*", the forget gate is F, € R”*”, and the output gate is O; € R’*”. They 
are calculated as follows: 


I, = 0 (XW, + Hy-1 Whi + bi), 
F, = 0(X, Wye + H;-1 Woe + be), (10.1.1) 
O; = o (Xt Wyo + H; -1 Who + bo), 
where W,i, Wyt, Wxo € R?” and Whi, Wht, Who € R’*” are weight parameters and 
bi, bf, bo € R!*} are bias parameters. Note that broadcasting (see Section 2.1.4) is trig- 


gered during the summation. We use sigmoid functions (as introduced in Section 5.1) to 
map the input values to the interval (0, 1). 


Input Node 


Next we design the memory cell. Since we have not specified the action of the various gates 
yet, we first introduce the input node Č, € R"*". Its computation is similar to that of the 
three gates described above, but uses a tanh function with a value range for (—1, 1) as the 
activation function. This leads to the following equation at time step t: 


Č, = tanh(X;,W xe + H;-1 Whe + Be), (10.1.2) 
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where W,, € RIX} and Whe € RË} are weight parameters and be € R!” is a bias 
parameter. 


A quick illustration of the input node is shown in Fig. 10.1.2. 


Hidden state 


H, 


Input X, 
FC layer with 
C] activation function ite Copy Coneatenate 


Computing the input node in an LSTM model. 


Memory Cell Internal State 


In LSTMs, the input gate I; governs how much we take new data into account via Č; and 
the forget gate F, addresses how much of the old cell internal state C,_; € R”*h we retain. 
Using the Hadamard (elementwise) product operator © we arrive at the following update 
equation: 


C, = F, © C,_; +1, © C,. (10.1.3) 


If the forget gate is always 1 and the input gate is always 0, the memory cell internal state 
C,-1ı will remain constant forever, passing unchanged to each subsequent time step. How- 
ever, input gates and forget gates give the model the flexibility of being able to learn when 
to keep this value unchanged and when to perturb it in response to subsequent inputs. In 
practice, this design alleviates the vanishing gradient problem, resulting in models that are 
much easier to train, especially when facing datasets with long sequence lengths. 


We thus arrive at the flow diagram in Fig. 10.1.3. 


Hidden State 


Last, we need to define how to compute the output of the memory cell, i.e., the hidden state 
H, € R”*", as seen by other layers. This is where the output gate comes into play. In 
LSTMs, we first apply tanh to the memory cell internal state and then apply another point- 
wise multiplication, this time with the output gate. This ensures that the values of H; are 
always in the interval (—1, 1): 


H, = O; © tanh(C,). (10.1.4) 


Whenever the output gate is close to 1, we allow the memory cell internal state to impact 
the subsequent layers uninhibited, whereas for output gate values close to 0, we prevent the 
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Memory cell 
internal state Cc, 
Cı 
Output 
gate 
Hidden state 
H, 


1-1 


Input X, 
FC layer with Elementwise c C tends 
Le] activation function operator j opy an ‘oncatenate 


Computing the memory cell internal state in an LSTM model. 


current memory from impacting other layers of the network at the current time step. Note 
that a memory cell can accrue information across many time steps without impacting the 
rest of the network (as long as the output gate takes values close to 0), and then suddenly 
impact the network at a subsequent time step as soon as the output gate flips from values 
close to 0 to values close to 1. Fig. 10.1.4 has a graphical illustration of the data flow. 


Memory cell 
internal state 
Ç 


1-1 


Hidden state 
H, 


Hi 


Input X, 
FC layer with Elementwise c C tenai 
Le] activation function operator j opy an ‘oncatenate 


Computing the hidden state in an LSTM model. 


10.1.2 Implementation from Scratch 


Now let’s implement an LSTM from scratch. As same as the experiments in Section 9.5, 
we first load The Time Machine dataset. 


Initializing Model Parameters 


Next, we need to define and initialize the model parameters. As previously, the hyperpa- 
rameter num_hiddens dictates the number of hidden units. We initialize weights following 
a Gaussian distribution with 0.01 standard deviation, and we set the biases to 0. 


class LSTMScratch(d21.Module) : 
def __init__(self, num_inputs, num_hiddens, sigma=0.01): 


(continues on next page) 


374 


Modern Recurrent Neural Networks 


(continued from previous page) 


super().__init__Q 
self.save_hyperparameters() 


init_weight = lambda *shape: nn.Parameter(torch.randn(*shape) * sigma) 

triple = lambda: (init_weight(num_inputs, num_hiddens) , 
init_weight(num_hiddens, num_hiddens) , 
nn.Parameter (torch. zeros(num_hiddens) )) 

self.W_xi, self.W_hi, self.b_i = triple() # Input gate 

self .W_xf, self.W_hf, self.b_f = triple() # Forget gate 

self.W_xo, self.W_ho, self.b_o = triple() # Output gate 

self.W_xc, self.W_hc, self.b_c = triple() # Input node 


The actual model is defined as described above, consisting of three gates and an input node. 
Note that only the hidden state is passed to the output layer. 


@d21.add_to_class(LSTMScratch) 
def forward(self, inputs, H_C=None): 
if H_C is None: 
# Initial state with shape: (batch_size, num_hiddens) 
H = torch.zeros((inputs.shape[1], self.num_hiddens) , 
device=inputs. device) 
C = torch.zeros((inputs.shape[1], self.num_hiddens) , 
device=inputs.device) 
else: 
H € = tLe 
outputs = [] 
for X in inputs: 
I = torch.sigmoid(torch.matmul(X, self.W_xi) + 
torch.matmul(H, self.W_hi) + self.b_i) 
F = torch.sigmoid(torch.matmul(X, self.W_xf) + 
torch.matmul(H, self.W_hf) + self.b_f) 
torch.sigmoid(torch.matmul(X, self .W_xo) + 
torch.matmul(H, self.W_ho) + self.b_o) 
C_tilde = torch.tanh(torch.matmul(X, self.W_xc) + 
torch.matmul(H, self.W_hc) + self.b_c) 
C =F x*C +I xC tilde 
H = O x torch. tanh(C) 
outputs. append(H) 
return outputs, (H, C) 


0 


I 


od 


Training and Prediction 


Let’s train an LSTM model by instantiating the RNNLMScratch class from Section 9.5. 


data = d21.TimeMachine(batch_size=1024, num_steps=32) 

lstm = LSTMScratch(num_inputs=len(data.vocab), num_hiddens=32) 

model = d21.RNNLMScratch(lstm, vocab_size=len(data.vocab), 1r=4) 
trainer = d21.Trainer(max_epochs=50, gradient_clip_val=1, num_gpus=1) 
trainer.fit(model, data) 


10.1.3 Concise Implementation 
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— train pp! 
204 =se val_ppl 


Using high-level APIs, we can directly instantiate an LSTM model. This encapsulates 
all the configuration details that we made explicit above. The code is significantly faster 
as it uses compiled operators rather than Python for many details that we spelled out be- 
fore. 


class LSTM(d21.RNN): 
def __init__(self, num_inputs, num_hiddens): 
d2l.Module.__init__(self) 
self.save_hyperparameters() 
self.rnn = nn.LSTM(num_inputs, num_hiddens) 


def forward(self, inputs, H_C=None): 
return self.rnn(inputs, H_C) 


lstm = LSTM(num_inputs=len(data.vocab), num_hiddens=32) 
model = d21.RNNLM(lstm, vocab_size=len(data.vocab), 1r=4) 
trainer.fit(model, data) 


— train_ppl 
==- val_ppl 


20 


model.predict(’it has’, 20, data.vocab, d21.try_gpu()) 


"it has a the time travelly' 


LSTMs are the prototypical latent variable autoregressive model with nontrivial state con- 
trol. Many variants thereof have been proposed over the years, e.g., multiple layers, resid- 
ual connections, different types of regularization. However, training LSTMs and other 
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sequence models (such as GRUs) is quite costly because of the long range dependency of 
the sequence. Later we will encounter alternative models such as Transformers that can be 
used in some cases. 


10.1.4 Summary 


While LSTMs were published in 1997, they rose to great prominence with some victories in 
prediction competitions in the mid-2000s, and became the dominant models for sequence 
learning from 2011 until the rise of Transformer models, starting in 2017. Even Tran- 
formers owe some of their key ideas to architecture design innovations introduced by the 
LSTM. 


LSTMs have three types of gates: input gates, forget gates, and output gates that control the 
flow of information. The hidden layer output of LSTM includes the hidden state and the 
memory cell internal state. Only the hidden state is passed into the output layer while the 
memory cell internal state remains entirely internal. LSTMs can alleviate vanishing and 
exploding gradients. 


10.1.5 Exercises 


1. Adjust the hyperparameters and analyze their influence on running time, perplexity, and 
the output sequence. 


2. How would you need to change the model to generate proper words rather than just 
sequences of characters? 


3. Compare the computational cost for GRUs, LSTMs, and regular RNNs for a given hid- 
den dimension. Pay special attention to the training and inference cost. 


4. Since the candidate memory cell ensures that the value range is between —1 and 1 by 
using the tanh function, why does the hidden state need to use the tanh function again 
to ensure that the output value range is between —1 and 1? 


5. Implement an LSTM model for time series prediction rather than character sequence 
prediction. 


Discussions 145, 


10.2 Gated Recurrent Units (GRU) 
—————————————— eee 


As RNNs and particularly the LSTM architecture (Section 10.1) rapidly gained popularity 
during the 2010s, a number of researchers began to experiment with simplified architec- 
tures in hopes of retaining the key idea of incorporating an internal state and multiplicative 
gating mechanisms but with the aim of speeding up computation. The gated recurrent unit 


377 


Gated Recurrent Units (GRU) 


(GRU) (Cho et al., 2014) offered a streamlined version of the LSTM memory cell that of- 
ten achieves comparable performance but with the advantage of being faster to compute 
(Chung et al., 2014). 


import torch 
from torch import nn 
from d21 import torch as d21 


10.2.1 Reset Gate and Update Gate 


Here, the LSTM’s three gates are replaced by two: the reset gate and the update gate. As 
with LSTMs, these gates are given sigmoid activations, forcing their values to lie in the 
interval (0, 1). Intuitively, the reset gate controls how much of the previous state we might 
still want to remember. Likewise, an update gate would allow us to control how much of 
the new state is just a copy of the old one. Fig. 10.2.1 illustrates the inputs for both the reset 
and update gates in a GRU, given the input of the current time step and the hidden state 
of the previous time step. The outputs of the gates are given by two fully connected layers 
with a sigmoid activation function. 

(OY 

miggen state N 

S Reset Update 


gate gate 
R, Z, 
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NS 


Input X, 
FC layer with 
E] activation function aia Copy Concatenate 


Computing the reset gate and the update gate in a GRU model. 


Mathematically, for a given time step t, suppose that the input is a minibatch X, € R”*4 
(number of examples = n; number of inputs = d) and the hidden state of the previous time 
step is H,_; € R”*” (number of hidden units = A). Then the reset gate R, € R”*” and 
update gate Z, € R”*” are computed as follows: 


R; = 0 (X; Wy + H-1 Wor + by), 


(10.2.1) 
Z; = o (X+Wyz + Hy-1 Wrz + bz), 


where Wyr, Wx, € R?*" and Wr, Wn € R’*" are weight parameters and b,, b, € R!*” 
are bias parameters. 


10.2.2 Candidate Hidden State 
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Next, we integrate the reset gate R, with the regular updating mechanism in (9.4.5), leading 
to the following candidate hidden state Ñ, € R’*" at time step t: 


H, = tanh(X,W,, + (R; © H,_1) Wan + bn), (10.2.2) 


where Wy, € R?” and Wpp, € R”*” are weight parameters, bp € R!*" is the bias, and the 
symbol © is the Hadamard (elementwise) product operator. Here we use a tanh activation 
function. 


The result is a candidate, since we still need to incorporate the action of the update gate. 
Comparing with (9.4.5), the influence of the previous states can now be reduced with the 
elementwise multiplication of R; and H,-; in (10.2.2). Whenever the entries in the reset 
gate R, are close to 1, we recover a vanilla RNN such as that in (9.4.5). For all entries of 
the reset gate R; that are close to 0, the candidate hidden state is the result of an MLP with 
X; as input. Any pre-existing hidden state is thus reset to defaults. 


Fig. 10.2.2 illustrates the computational flow after applying the reset gate. 
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Computing the candidate hidden state in a GRU model. 


10.2.3 Hidden State 


Finally, we need to incorporate the effect of the update gate Z,. This determines the extent 
to which the new hidden state H, € R”*” matches the old state H,—1 compared with how 
much it resembles the new candidate state H,;. The update gate Z; can be used for this 
purpose, simply by taking elementwise convex combinations of H;_; and H,. This leads 
to the final update equation for the GRU: 


H; = Z, © H, + (1 - Z,) o Hy. (10.2.3) 


Whenever the update gate Z, is close to 1, we simply retain the old state. In this case 
the information from X; is ignored, effectively skipping time step t in the dependency 
chain. By contrast, whenever Z, is close to 0, the new latent state H, approaches the 
candidate latent state H;. Fig. 10.2.3 shows the computational flow after the update gate is 
in action. 


In summary, GRUs have the following two distinguishing features: 


379 Gated Recurrent Units (GRU) 


Hidden state 


H, 


1 
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gate Candidate 


_, hidden state 
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FC layer with Elementwise 
|e | activation function operator Jj. Copy Coneatenate 
Computing the hidden state in a GRU model. 


e Reset gates help capture short-term dependencies in sequences. 


e Update gates help capture long-term dependencies in sequences. 


10.2.4 Implementation from Scratch 


To gain a better understanding of the GRU model, let’s implement it from scratch. 


Initializing Model Parameters 


The first step is to initialize the model parameters. We draw the weights from a Gaussian 
distribution with standard deviation to be sigma and set the bias to 0. The hyperparameter 
num_hiddens defines the number of hidden units. We instantiate all weights and biases 
relating to the update gate, the reset gate, and the candidate hidden state. 


class GRUScratch(d21.Module) : 
def __init__(self, num_inputs, num_hiddens, sigma=0.01): 
super().__init__Q 
self.save_hyperparameters() 


init_weight = lambda xshape: nn.Parameter(torch.randn(*shape) * sigma) 

triple = lambda: (init_weight(num_inputs, num_hiddens), 
init_weight(num_hiddens, num_hiddens) , 
nn.Parameter (torch. zeros(num_hiddens) )) 

self.W_xz, self.W_hz, self.b_z = triple() # Update gate 

self.W_xr, self.W_hr, self.b_r = triple() # Reset gate 

self.W_xh, self.W_hh, self.b_h = triple() # Candidate hidden state 


Defining the Model 


Now we are ready to define the GRU forward computation. Its structure is the same as that 
of the basic RNN cell, except that the update equations are more complex. 


@d21.add_to_class(GRUScratch) 
def forward(self, inputs, H=None): 


(continues on next page) 
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(continued from previous page) 


if H is None: 
# Initial state with shape: (batch_size, num_hiddens) 
H = torch.zeros((inputs.shape[1], self.num_hiddens) , 
device=inputs.device) 
outputs = [] 
for X in inputs: 
Z = torch.sigmoid(torch.matmul(X, self.W_xz) + 
torch.matmul(H, self.W_hz) + self.b_z) 
torch.sigmoid(torch.matmul(X, self.W_xr) + 
torch.matmul(H, self.W_hr) + self.b_r) 
H_tilde = torch.tanh(torch.matmul(X, self.W_xh) + 
torch.matmul(R *« H, self.W_hh) + self.b_h) 
H = Zx H+ (1 - Z) * H_tilde 
outputs. append(H) 
return outputs, H 


R 


iT 


Training 


Training a language model on The Time Machine dataset works in exactly the same manner 
as in Section 9.5. 


data = d21.TimeMachine(batch_size=1024, num_steps=32) 

gru = GRUScratch(num_inputs=len(data. vocab), num_hiddens=32) 

model = d21.RNNLMScratch(gru, vocab_size=len(data.vocab), 1r=4) 
trainer = d21.Trainer(max_epochs=50, gradient_clip_val=1, num_gpus=1) 
trainer.fit(model, data) 


— train_ppl 
204 i —-- val_ppl 
t 


10.2.5 Concise Implementation 


In high-level APIs, we can directly instantiate a GRU model. This encapsulates all the 
configuration detail that we made explicit above. 


class GRU(d21.RNN): 
def __init__(self, num_inputs, num_hiddens): 
d21.Module.__init__(self) 
self.save_hyperparameters() 
self.rnn = nn.GRU(num_inputs, num_hiddens) 
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The code is significantly faster in training as it uses compiled operators rather than Python. 


gru = GRU(num_inputs=len(data.vocab), num_hiddens=32) 
model = d21.RNNLM(gru, vocab_size=len(data.vocab), 1r=4) 
trainer.fit(model, data) 


20 — train_ppl 
—-- val_ppl 


After training, we print out the perplexity on the training set and the predicted sequence 
following the provided prefix. 


model.predict(’it has’, 20, data.vocab, d21.try_gpu()) 


1 


"it has so it and the time 


10.2.6 Summary 


Compared with LSTMs, GRUs achieve similar performance but tend to be lighter com- 
putationally. Generally, compared with simple RNNs, gated RNNS, just like LSTMs and 
GRUs, can better capture dependencies for sequences with large time step distances. GRUs 
contain basic RNN%s as their extreme case whenever the reset gate is switched on. They can 
also skip subsequences by turning on the update gate. 


10.2.7 Exercises 


1. Assume that we only want to use the input at time step ¢’ to predict the output at time 
step t > t’. What are the best values for the reset and update gates for each time step? 


2. Adjust the hyperparameters and analyze their influence on running time, perplexity, and 
the output sequence. 


3. Compare runtime, perplexity, and the output strings for rnn.RNN and rnn.GRU imple- 
mentations with each other. 


4. What happens if you implement only parts of a GRU, e.g., with only a reset gate or only 
ERE an update gate? 


ot Discussions 146, 
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10.3 Deep Recurrent Neural Networks 
L——_—————— SS asl 


Up until now, we have focused on defining networks consisting of a sequence input, a single 
hidden RNN layer, and an output layer. Despite having just one hidden layer between the 
input at any time step and the corresponding output, there is a sense in which these networks 
are deep. Inputs from the first time step can influence the outputs at the final time step 
T (often 100s or 1000s of steps later). These inputs pass through T applications of the 
recurrent layer before reaching the final output. However, we often also wish to retain the 
ability to express complex relationships between the inputs at a given time step and the 
outputs at that same time step. Thus we often construct RNNs that are deep not only in the 
time direction but also in the input-to-output direction. This is precisely the notion of depth 
that we have already encountered in our development of MLPs and deep CNNs. 


The standard method for building this sort of deep RNN is strikingly simple: we stack 
the RNNs on top of each other. Given a sequence of length T, the first RNN produces a 
sequence of outputs, also of length T. These, in turn, constitute the inputs to the next RNN 
layer. In this short section, we illustrate this design pattern and present a simple example for 
how to code up such stacked RNNs. Below, in Fig. 10.3.1, we illustrate a deep RNN with L 
hidden layers. Each hidden state operates on a sequential input and produces a sequential 
output. Moreover, any RNN cell (white box in Fig. 10.3.1) at each time step depends on 
both the same layer’s value at the previous time step and the previous layer’s value at the 
same time step. 


Architecture of a deep RNN. 


Formally, suppose that we have a minibatch input X; € R”*4 (number of examples = n; 
number of inputs in each example = d) at time step t. At the same time step, let the hidden 
state of the 1™ hidden layer (J = 1,..., L) be HO € R”*} (number of hidden units = h) 
and the output layer variable be O, € R”*41 (number of outputs: q). Setting HO =X, 
the hidden state of the /'" hidden layer that uses the activation function ¢ is calculated as 
follows: 


HO = (HUP WO + HO WO +b), (10.3.1) 
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where the weights wi? e R’™" and wi? e R’*", together with the bias bt” e R!*}, are 
the model parameters of the /™ hidden layer. 


At the end, the calculation of the output layer is only based on the hidden state of the final 
L"™ hidden layer: 


O, = Hi Wig + ba, (10.3.2) 


where the weight Whq € R’*4 and the bias bg € R!*4 are the model parameters of the 
output layer. 


Just as with MLPs, the number of hidden layers L and the number of hidden units h are hy- 
perparameters that we can tune. Common RNN layer widths (A) are in the range (64, 2056), 
and common depths (L) are in the range (1, 8). In addition, we can easily get a deep-gated 
RNN by replacing the hidden state computation in (10.3.1) with that from an LSTM or a 
GRU. 


import torch 
from torch import nn 
from d21 import torch as d21 


10.3.1 Implementation from Scratch 


To implement a multilayer RNN from scratch, we can treat each layer as an RNNScratch 
instance with its own learnable parameters. 


class StackedRNNScratch(d21.Module) : 
def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01): 
super().__init__Q 
self.save_hyperparameters() 
self.rnns = nn.Sequential(*[d21.RNNScratch( 
num_inputs if i==0 else num_hiddens, num_hiddens, sigma) 
for i in range(num_layers) ]) 


The multilayer forward computation simply performs forward computation layer by layer. 


@d21.add_to_class(StackedRNNScratch) 
def forward(self, inputs, Hs=None): 
outputs = inputs 
if Hs is None: Hs = [None] * self.num_layers 
for i in range(self.num_layers): 
outputs, Hs[i] = self.rnnsli](outputs, Hs[i]) 
outputs = torch.stack(outputs, 0) 
return outputs, Hs 


As an example, we train a deep GRU model on The Time Machine dataset (same as in 
Section 9.5). To keep things simple we set the number of layers to 2. 
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data = d21.TimeMachine(batch_size=1024, num_steps=32) 

rnn_block = StackedRNNScratch(num_inputs=len(data.vocab) , 
num_hiddens=32, num_layers=2) 

model = d21.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), 1r=2) 

trainer = d21.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1) 

trainer.fit(model, data) 


— train_ppl 
=-=- val_ppl 


ENSA et a ne || 


10.3.2 Concise Implementation 


Fortunately many of the logistical details required to implement multiple layers of an RNN 
are readily available in high-level APIs. Our concise implementation will use such built- 
in functionalities. The code generalizes the one we used previously in Section 10.2, let- 
ting us specify the number of layers explicitly rather than picking the default of only one 
layer. 


class GRU(d21.RNN): #@save 
"""The multilayer GRU model.””” 
def __init__(self, num_inputs, num_hiddens, num_layers, dropout=@): 
d21.Module.__init__(self) 
self.save_hyperparameters() 
self.rnn = nn.GRU(num_inputs, num_hiddens, num_layers, 
dropout=dropout) 


The architectural decisions such as choosing hyperparameters are very similar to those of 
Section 10.2. We pick the same number of inputs and outputs as we have distinct tokens, 
i.e., vocab_size. The number of hidden units is still 32. The only difference is that we now 
select a nontrivial number of hidden layers by specifying the value of num_layers. 


gru = GRU(num_inputs=len(data. vocab), num_hiddens=32, num_layers=2) 
model = d21.RNNLM(gru, vocab_size=len(data.vocab), 1r=2) 
trainer.fit(model, data) 


model.predict(’it has’, 20, data.vocab, d21.try_gpu()) 


"it has for and the time th’ 
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—— train_ppl 
=-=- val_ppl 


ne ne mn. 


10.3.3 Summary 


In deep RNNs, the hidden state information is passed to the next time step of the current 
layer and the current time step of the next layer. There exist many different flavors of 
deep RNNs, such as LSTMs, GRUs, or vanilla RNNs. Conveniently, these models are 
all available as parts of the high-level APIs of deep learning frameworks. Initialization of 
models requires care. Overall, deep RNNs require considerable amount of work (such as 
learning rate and clipping) to ensure proper convergence. 


10.3.4 Exercises 
1. Replace the GRU by an LSTM and compare the accuracy and training speed. 


2. Increase the training data to include multiple books. How low can you go on the per- 
plexity scale? 


3. Would you want to combine sources of different authors when modeling text? Why is 
this a good idea? What could go wrong? 


Discussions 147. 


10.4 Bidirectional Recurrent Neural Networks 
———————————— es 


So far, our working example of a sequence learning task has been language modeling, where 
we aim to predict the next token given all previous tokens in a sequence. In this scenario, 
we wish only to condition upon the leftward context, and thus the unidirectional chaining of 
a standard RNN seems appropriate. However, there are many other sequence learning tasks 
contexts where it is perfectly fine to condition the prediction at every time step on both the 
leftward and the rightward context. Consider, for example, part of speech detection. Why 
shouldn’t we take the context in both directions into account when assessing the part of 
speech associated with a given word? 


Another common task—often useful as a pretraining exercise prior to fine-tuning a model 
on an actual task of interest—is to mask out random tokens in a text document and then 
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to train a sequence model to predict the values of the missing tokens. Note that depend- 
ing on what comes after the blank, the likely value of the missing token changes dramati- 
cally: 


e lam___ 


e Tam ___ hungry. 


e Tam ___ hungry, and I can eat half a pig. 


In the first sentence “happy” seems to be a likely candidate. The words “not” and “very” 
seem plausible in the second sentence, but “not” seems incompatible with the third sen- 
tences. 


Fortunately, a simple technique transforms any unidirectional RNN into a bidirectional 
RNN (Schuster and Paliwal, 1997). We simply implement two unidirectional RNN layers 
chained together in opposite directions and acting on the same input (Fig. 10.4.1). For 
the first RNN layer, the first input is x; and the last input is xy, but for the second RNN 
layer, the first input is xy and the last input is xı. To produce the output of this bidirectional 
RNN layer, we simply concatenate together the corresponding outputs of the two underlying 
unidirectional RNN layers. 


o o, 
PA 
H |< 
= 
X, 


H, 
TETS 
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Architecture of a bidirectional RNN. 


Formally for any time step t, we consider a minibatch input X, € R”*? (number of exam- 
ples = n; number of inputs in each example = d) and let the hidden layer activation function 
be ¢. In the bidirectional architecture, the forward and backward hidden states for this time 
step are H, e R"*" and H, e R™h respectively, where h is the number of hidden units. 
The forward and backward hidden state updates are as follows: 


= = 
H, = 6(X,W) +H WwW +b), 


(10.4.1) 


— — 
H, = 6(X,W? + H WO +b), 


where the weights wi? E RIXA, WEP € RIXA, wl?) e R@X" and wi e R”*}, and 
the biases b ) eR and bP € R!*" are all the model parameters. 


Next, we concatenate the forward and backward hidden states H, and H, to obtain the 
hidden state H, € R’*?" for feeding into the output layer. In deep bidirectional RNNs with 
multiple hidden layers, such information is passed on as input to the next bidirectional layer. 
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Last, the output layer computes the output O, € R’*? (number of outputs = q): 


Here, the weight matrix Whq € R2"*4 and the bias bg € R!*4 are the model parameters 
of the output layer. While technically, the two directions can have different numbers of 
hidden units, this design choice is seldom made in practice. We now demonstrate a simple 
implementation of a bidirectional RNN. 


import torch 
from torch import nn 
from d21 import torch as d21 


10.4.1 Implementation from Scratch 


To implement a bidirectional RNN from scratch, we can include two unidirectional RNNScratch 
instances with separate learnable parameters. 


class BiRNNScratch(d21.Module) : 
def __init__(self, num_inputs, num_hiddens, sigma=0.01): 
super().__init__() 
self.save_hyperparameters() 
self.f_rnn = d21.RNNScratch(num_inputs, num_hiddens, sigma) 
self.b_rnn = d21.RNNScratch(num_inputs, num_hiddens, sigma) 
self.num_hiddens «= 2 # The output dimension will be doubled 


States of forward and backward RNNs are updated separately, while outputs of these two 
RNNs are concatenated. 


@d21.add_to_class(BiRNNScratch) 

def forward(self, inputs, Hs=None): 
f_H, b_H = Hs if Hs is not None else (None, None) 
f_outputs, f_H = self.f_rnn(inputs, f_H) 
b_outputs, b_H = self.b_rnn(reversed(inputs), b_H) 
outputs = [torch.cat((f, b), -1) for f, b in zip( 

f_outputs, reversed(b_outputs)) ] 

return outputs, (f_H, b_H) 


10.4.2 Concise Implementation 


Using the high-level APIs, we can implement bidirectional RNNs more concisely. Here we 
take a GRU model as an example. 


class BiGRU(d21.RNN): 
def __init__(self, num_inputs, num_hiddens): 
d21.Module.__init__(self) 
self.save_hyperparameters() 
self.rnn = nn.GRU(num_inputs, num_hiddens, bidirectional=True) 
self.num_hiddens «= 2 
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10.4.3 Summary 


In bidirectional RNNs, the hidden state for each time step is simultaneously determined 
by the data prior to and after the current time step. Bidirectional RNNs are mostly use- 
ful for sequence encoding and the estimation of observations given bidirectional context. 
Bidirectional RNNs are very costly to train due to long gradient chains. 


10.4.4 Exercises 


1. If the different directions use a different number of hidden units, how will the shape of 
H, change? 


2. Design a bidirectional RNN with multiple hidden layers. 


3. Polysemy is common in natural languages. For example, the word “bank” has different 
meanings in contexts “i went to the bank to deposit cash” and “i went to the bank to sit 
down”. How can we design a neural network model such that given a context sequence 
and a word, a vector representation of the word in the correct context will be returned? 
What type of neural architectures is preferred for handling polysemy? 


Discussions !48. 


10.5 Machine Translation and the Dataset 
T a) 


Among the major breakthroughs that prompted widespread interest in modern RNNs was 
a major advance in the applied field of statistical machine translation. Here, the model is 
presented with a sentence in one language and must predict the corresponding sentence in 
another. Note that here the sentences may be of different lengths, and that corresponding 
words in the two sentences may not occur in the same order, owing to differences in the 
two language’s grammatical structure. 


Many problems have this flavor of mapping between two such “unaligned” sequences. 
Examples include mapping from dialog prompts to replies or from questions to answers. 
Broadly, such problems are called sequence-to-sequence (seq2seq) problems and they are 
our focus for both the remainder of this chapter and much of Chapter 11. 


In this section, we introduce the machine translation problem and an example dataset that 
we will use in the subsequent examples. For decades, statistical formulations of translation 
between languages had been popular (Brown et al., 1990, Brown et al., 1988), even before 
researchers got neural network approaches working (methods were often lumped together 
under the term neural machine translation). 


First we will need some new code to process our data. Unlike the language modeling that 
we saw in Section 9.3, here each example consists of two separate text sequences, one in the 
source language and another (the translation) in the target language. The following code 
snippets will show how to load the preprocessed data into minibatches for training. 
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import os 
import torch 
from d21 import torch as d21 


10.5.1 Downloading and Preprocessing the Dataset 


To begin, we download an English-French dataset that consists of bilingual sentence pairs 
from the Tatoeba Project 149 
of an English text sequence (the source) and the translated French text sequence (the tar- 
get). Note that each text sequence can be just one sentence, or a paragraph of multiple 
sentences. 


. Each line in the dataset is a tab-delimited pair consisting 


class MTFraEng(d21.DataModule): #@save 
"""The English-French dataset.””” 
def _download(self): 
d21.extract(d21.download( 
d21.DATA_URL+'fra-eng.zip’, self.root, 
"94646ad1522d915e7b0f9296181140edcf86a4f5')) 
with open(self.root + '/fra-eng/fra.txt’, encoding='utf-8') as f: 
return f.read() 


data = MTFraEng() 
raw_text = data._download() 
print (raw_textL:75]) 


Downloading ../data/fra-eng.zip from http: //d21-data.s3-accelerate.amazonaws. 
—com/fra-eng.zip... 


Go. Va ! 

Hi. Salut ! 

Run! Cours ! 
Run! Courez ! 
Who? Qui ? 
Wow! Ca alors ! 


After downloading the dataset, we proceed with several preprocessing steps for the raw text 
data. For instance, we replace non-breaking space with space, convert uppercase letters to 
lowercase ones, and insert space between words and punctuation marks. 


@d21.add_to_class(MTFraEng) #@save 
def _preprocess(self, text): 
# Replace non-breaking space with space 
text = text.replace('\u202f', ' ').replace(’\xa@’, ' ') 
# Insert space between words and punctuation marks 
no_space = lambda char, prev_char: char in ',.!?’ and prev_char != 
out = [' ' + char if i > @ and no_space(char, text[i - 1]) else char 
for i, char in enumerate(text.lower())] 
return ''.join(out) 


Er 
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text = data._preprocess(raw_text) 
print(textL:80]) 


go. va ! 

hi. salut ! 
run ! cours ! 
run ! courez ! 
who ? qui ? 

wow ! ça alors ! 


10.5.2 Tokenization 


Unlike the character-level tokenization in Section 9.3, for machine translation we prefer 
word-level tokenization here (today’s state-of-the-art models use more complex tokeniza- 
tion techniques). The following _tokenize method tokenizes the first max_examples text 
sequence pairs, where each token is either a word or a punctuation mark. We append the 
special “<eos>” token to the end of every sequence to indicate the end of the sequence. 
When a model is predicting by generating a sequence token after token, the generation of 
the “<eos>” token can suggest that the output sequence is complete. In the end, the method 
below returns two lists of token lists: src and tgt. Specifically, src[i] is a list of tokens 
from the i™ text sequence in the source language (English here) and tgt[i] is that in the 
target language (French here). 


@d21.add_to_class(MTFraEng) #@save 
def _tokenize(self, text, max_examples=None): 
Sire, wae = i 
for i, line in enumerate(text.split(’\n')): 
if max_examples and i > max_examples: break 
parts = line.split('\t’) 
if len(parts) == 2: 
# Skip empty tokens 
src.append([t for t in f’{parts[Q]} <eos>’.split(’ ') if t]) 
tgt.append([t for t in f’{parts[1]} <eos>’.split(’ ') if t]) 
return src, tgt 


src, tgt = data._tokenize(text) 
grels, wxellsel 


(LE'go’, '.', '<eos>'], 
['hi'’, '.’, ‘'<eos>'], 
['run', '!', '<eos>’], 
['run', “re; ‘'<eos>’], 
[L'who', '?', '<eos>’], 
['wow', '!', ‘'<eos>']], 

[L’'va', '!’, ‘<eos>'], 
L-salut*,< "1", “<eos>"]), 
[C'cours', '!’, '<eos>'], 
['courez', '!', '<eos>'], 


(continues on next page) 
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['qui’, '?’, '<eos>'], 
['ca', ‘alors’, '!’, '<eos>’]]) 


Let’s plot the histogram of the number of tokens per text sequence. In this simple English— 
French dataset, most of the text sequences have fewer than 20 tokens. 


#@save 
def show_list_len_pair_hist(legend, xlabel, ylabel, xlist, ylist): 
"""Plot the histogram for list length pairs. e 
d21.set_figsize() 
_, —, patches = d21.plt.hist( 
[Llen(1l) for 1 in xlist], [Llen(1) for 1 in ylist]]) 
d21.plt.xlabel(xlabel) 
d21.plt.ylabel(ylabel) 
for patch in patches[1].patches: 
patch.set_hatch('/’) 
d21.plt.legend(legend) 


show_list_len_pair_hist([’source’, 'target’], '# tokens per sequence’, 
(CCUM 5 SE, TEL). 


100000 4 EE source 
mM target 
80000 4 
S 60000 4 
o 
v 
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20000 4 
0 T T T 
20 40 60 


# tokens per sequence 


10.5.3 Loading Sequences of Fixed Length 


Recall that in language modeling each example sequence, either a segment of one sentence 
or a span over multiple sentences, had a fixed length. This was specified by the num_steps 
(number of time steps or tokens) argument from Section 9.3. In machine translation, each 
example is a pair of source and target text sequences, where the two text sequences may 
have different lengths. 


For computational efficiency, we can still process a minibatch of text sequences at one time 
by truncation and padding. Suppose that every sequence in the same minibatch should have 
the same length num_steps. If a text sequence has fewer than num_steps tokens, we will 
keep appending the special “<pad>” token to its end until its length reaches num_steps. 
Otherwise, we will truncate the text sequence by only taking its first num_steps tokens and 
discarding the remaining. In this way, every text sequence will have the same length to be 
loaded in minibatches of the same shape. Furthermore, we also record length of the source 
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sequence excluding padding tokens. This information will be needed by some models that 
we will cover later. 


Since the machine translation dataset consists of pairs of languages, we can build two vo- 
cabularies for both the source language and the target language separately. With word-level 
tokenization, the vocabulary size will be significantly larger than that using character-level 
tokenization. To alleviate this, here we treat infrequent tokens that appear less than twice 
as the same unknown (“<unk>”) token. As we will explain later (Fig. 10.7.1), when train- 
ing with target sequences, the decoder output (label tokens) can be the same decoder input 
(target tokens), shifted by one token; and the special beginning-of-sequence “<bos>” token 
will be used as the first input token for predicting the target sequence (Fig. 10.7.3). 


@d21.add_to_class(MTFraEng) #@save 
def __init__(self, batch_size, num_steps=9, num_train=512, num_val=128): 
super(MTFraEng, self).__init__() 
self.save_hyperparameters() 
self.arrays, self.src_vocab, self.tgt_vocab = self._build_arrays( 
self ._download()) 


@d21.add_to_class(MTFraEng) #@save 
def _build_arrays(self, raw_text, src_vocab=None, tgt_vocab=None): 
def _build_array(sentences, vocab, is_tgt=False): 
pad_or_trim = lambda seq, t: ( 
seqL:t] if len(seq) > t else seq + [’<pad>’] * (t - len(seq))) 
sentences = [pad_or_trim(s, self.num_steps) for s in sentences] 
if is_tgt: 
sentences = [[’<bos>'] + s for s in sentences] 
if vocab is None: 
vocab = d21.Vocab(sentences, min_freq=2) 
array = torch.tensor(Lvocab[s] for s in sentences]) 
valid_len = (array != vocab['<pad>']).type(torch. int32).sum(1) 
return array, vocab, valid_len 
src, tgt = self._tokenize(self._preprocess(raw_text), 
self.num_train + self.num_val) 
src_array, src_vocab, src_valid_len = _build_array(src, src_vocab) 
tgt_array, tgt_vocab, _ = _build_array(tgt, tgt_vocab, True) 
return ((src_array, tgt_arrayL[:,:-1], src_valid_len, tgt_arrayL:,1:]), 
src_vocab, tgt_vocab) 


10.5.4 Reading the Dataset 


Finally, we define the get_dataloader method to return the data iterator. 


@d21.add_to_class(MTFraEng) #@save 

def get_dataloader(self, train): 
idx = slice(@, self.num_train) if train else slice(self.num_train, None) 
return self.get_tensorloader(self.arrays, train, idx) 


Let’s read the first minibatch from the English-French dataset. 
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data = MTFraEng(batch_size=3) 

src, tgt, src_valid_len, label = next(iter(data.train_dataloader())) 
print(’source:', src.type(torch. int32)) 

print(’decoder input:’, tgt.type(torch. int32)) 

print(’source len excluding pad:', src_valid_len.type(torch. int32)) 
print(’label:’, label.type(torch. int32)) 


source: tensor([[117, 182, Q, 3; 4, 4, 4, 4, 4], 
[. 62). 72, 2, 3 4 @& A 4 
[ 57, 124, 0, 3, 
decoder input: tensor([L 3, 
[ 3, 6, 2, 4, 5, 5, 5, 5, s 
[ 3, 180, ð, 4, 5, 5z 5, 5; 5]], dtype=torch. int32) 
source len excluding pad: tensor([4, 4, 4], dtype=torch. int32) 
label: tensor([[ 37, 100, 58, 160, 0, 4, 5, 5, 5], 
[ 6, 2, 4, 5, 5, 5, 5, 5, 5], 
(180, 0, 4, 5, 5, 5, 5, 5,  5]], dtype=torch. int32) 


j 


], dtype=torch. int32) 
4, 5; 5], 


We show a pair of source and target sequences processed by the above _build_arrays 
method (in the string format). 


@d21.add_to_class(MTFraEng) #@save 
def build(self, src_sentences, tgt_sentences): 
raw_text = '\n'.join([sre + '\t' + tgt for src, tgt in zip( 
src_sentences, tgt_sentences) ]) 
arrays, _, _ = self._build_arrays( 
raw_text, self.src_vocab, self.tgt_vocab) 
return arrays 


Sine, wat, -, — = chic eE lim 2°31, [Cosel 71) 
print(’source:', data.src_vocab.to_tokens(src[Q@].type(torch. int32))) 
print(’target:', data.tgt_vocab.to_tokens(tgt[0].type(torch. int32))) 


source: ['hi', '.’, '<eos>', '<pad>', '<pad>', '<pad>', '<pad>’, '<pad>', ' 
o<pad>'] 

target: ['<bos>’, ‘salut’, '.', '<eos>’, '<pad>', '<pad>’, '<pad>', '<pad>’, ' 
<pad>'] 


10.5.5 Summary 


In natural language processing, machine translation refers to the task of automatically map- 
ping from a sequence representing a string of text in a source language to a string represent- 
ing a plausible translation in a target language. Using word-level tokenization, the vocab- 
ulary size will be significantly larger than that using character-level tokenization, but the 
sequence lengths will be much shorter. To mitigate the large vocabulary size, we can treat 
infrequent tokens as some “unknown” token. We can truncate and pad text sequences so 
that all of them will have the same length to be loaded in minibatches. Modern implemen- 
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tations often bucket sequences with similar lengths to avoid wasting excessive computation 
on padding. 


10.5.6 Exercises 


1. Try different values of the max_examples argument in the _tokenize method. How 
does this affect the vocabulary sizes of the source language and the target language? 


2. Text in some languages such as Chinese and Japanese does not have word boundary 
indicators (e.g., space). Is word-level tokenization still a good idea for such cases? Why 
or why not? 


Discussions !°°. 


10.6 The Encoder—Decoder Architecture 


In general sequence-to-sequence problems like machine translation (Section 10.5), inputs 
and outputs are of varying lengths that are unaligned. The standard approach to handling 
this sort of data is to design an encoder—decoder architecture (Fig. 10.6.1) consisting of 
two major components: an encoder that takes a variable-length sequence as input, and a 
decoder that acts as a conditional language model, taking in the encoded input and the 
leftwards context of the target sequence and predicting the subsequent token in the target 
sequence. 


| emote = 


-| The encoder—decoder architecture. 


Let’s take machine translation from English to French as an example. Given an input 


sequence in English: “They”, “are”, “watching”, “.”, this encoder—decoder architecture 
first encodes the variable-length input into a state, then decodes the state to generate the 
translated sequence, token by token, as output: “Ils”, “regardent”, “.”. Since the encoder- 


decoder architecture forms the basis of different sequence-to-sequence models in subse- 
quent sections, this section will convert this architecture into an interface that will be im- 
plemented later. 


from torch import nn 
from d21 import torch as d21 


10.6.1 Encoder 


In the encoder interface, we just specify that the encoder takes variable-length sequences as 
input X. The implementation will be provided by any model that inherits this base Encoder 
class. 
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class Encoder(nn.Module): #@save 
"""The base encoder interface for the encoder--decoder architecture. 


def __init__(self): 
super().__init__Q 


nnn 


# Later there can be additional arguments (e.g., length excluding padding) 
def forward(self, X, xargs): 
raise NotImplementedError 


10.6.2 Decoder 


In the following decoder interface, we add an additional init_state method to convert the 
encoder output (enc_all_outputs) into the encoded state. Note that this step may require 
extra inputs, such as the valid length of the input, which was explained in Section 10.5. 
To generate a variable-length sequence token by token, every time the decoder may map 
an input (e.g., the generated token at the previous time step) and the encoded state into an 
output token at the current time step. 


class Decoder(nn.Module): #@save 
"""The base decoder interface for the encoder--decoder architecture. 


def __init__(self): 
super().__init__Q 


nnn 


# Later there can be additional arguments (e.g., length excluding padding) 
def init_state(self, enc_all_outputs, xargs): 
raise NotImplementedError 


def forward(self, X, state): 
raise NotImplementedError 


10.6.3 Putting the Encoder and Decoder Together 


In the forward propagation, the output of the encoder is used to produce the encoded state, 
and this state will be further used by the decoder as one of its input. 


class EncoderDecoder(d21.Classifier): #@save 
"""The base class for the encoder--decoder architecture. 
def __init__(self, encoder, decoder): 
super().__init__Q 
self.encoder = encoder 
self.decoder = decoder 


nnn 


def forward(self, enc_X, dec_X, xargs): 
enc_all_outputs = self.encoder(enc_X, xargs) 
dec_state = self.decoder.init_state(enc_all_outputs, xargs) 
# Return decoder output only 
return self.decoder(dec_X, dec_state)[0] 


In the next section, we will see how to apply RNNs to design sequence-to-sequence models 
based on this encoder—decoder architecture. 
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10.6.4 Summary 


Encoder-decoder architectures can handle inputs and outputs that both consist of variable- 
length sequences and thus are suitable for sequence-to-sequence problems such as machine 
translation. The encoder takes a variable-length sequence as input and transforms it into a 
state with a fixed shape. The decoder maps the encoded state of a fixed shape to a variable- 
length sequence. 


10.6.5 Exercises 


1. Suppose that we use neural networks to implement the encoder—decoder architecture. 
Do the encoder and the decoder have to be the same type of neural network? 


2. Besides machine translation, can you think of another application where the encoder- 
decoder architecture can be applied? 


Discussions !°!. 


10.7 Sequence-to-Sequence Learning for Machine 
Translation 


In so-called sequence-to-sequence problems such as machine translation (as discussed in 
Section 10.5), where inputs and outputs each consist of variable-length unaligned sequences, 
we generally rely on encoder—decoder architectures (Section 10.6). In this section, we will 
demonstrate the application of an encoder—decoder architecture, where both the encoder 
and decoder are implemented as RNNs, to the task of machine translation (Cho et al., 
2014, Sutskever et al., 2014). 


Here, the encoder RNN will take a variable-length sequence as input and transform it into 
a fixed-shape hidden state. Later, in Chapter 11, we will introduce attention mechanisms, 
which allow us to access encoded inputs without having to compress the entire input into a 
single fixed-length representation. 


Then to generate the output sequence, one token at a time, the decoder model, consisting 
of a separate RNN, will predict each successive target token given both the input sequence 
and the preceding tokens in the output. During training, the decoder will typically be con- 
ditioned upon the preceding tokens in the official “ground truth” label. However, at test 
time, we will want to condition each output of the decoder on the tokens already predicted. 
Note that if we ignore the encoder, the decoder in a sequence-to-sequence architecture be- 
haves just like a normal language model. Fig. 10.7.1 illustrates how to use two RNNs for 
sequence-to-sequence learning in machine translation. 


In Fig. 10.7.1, the special “<eos>” token marks the end of the sequence. Our model can 
stop making predictions once this token is generated. At the initial time step of the RNN 
decoder, there are two special design decisions to be aware of: First, we begin every input 
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Encoder Decoder 


Ils regardent i <eos> 


f t 1 


HHH 
rr rr er er oe GS eS LS E, 


They are watching = <eos> ‘i 


1 
<bos> Ils regardent 


| Sequence-to-sequence learning with an RNN encoder and an RNN decoder. 


with a special beginning-of-sequence “<bos>” token. Second, we may feed the final hidden 
state of the encoder into the decoder at every single decoding time step (Cho et al., 2014). 
In some other designs, such as that of Sutskever et al. (2014), the final hidden state of the 
RNN encoder is used to initiate the hidden state of the decoder only at the first decoding 
step. 


import collections 

import math 

import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


10.7.1 Teacher Forcing 


While running the encoder on the input sequence is relatively straightforward, handling 
the input and output of the decoder requires more care. The most common approach is 
sometimes called teacher forcing. Here, the original target sequence (token labels) is fed 
into the decoder as input. More concretely, the special beginning-of-sequence token and 
the original target sequence, excluding the final token, are concatenated as input to the 
decoder, while the decoder output (labels for training) is the original target sequence, shifted 


by one token: “<bos>’, “Ils”, “regardent”, “.” — “Ils”, “regardent”, “.”, “<eos>” (Fig. 
10.7.1). 


Our implementation in Section 10.5.3 prepared training data for teacher forcing, where 
shifting tokens for self-supervised learning is similar to the training of language models in 
Section 9.3. An alternative approach is to feed the predicted token from the previous time 
step as the current input to the decoder. 


In the following, we explain the design depicted in Fig. 10.7.1 in greater detail. We will 
train this model for machine translation on the English-French dataset as introduced in 
Section 10.5. 


10.7.2 Encoder 


Recall that the encoder transforms an input sequence of variable length into a fixed-shape 
context variable c (see Fig. 10.7.1). 


Consider a single sequence example (batch size 1). Suppose the input sequence is x1, . . . , XT, 
such that x; is the z™ token. At time step t, the RNN transforms the input feature vector x; 
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for x, and the hidden state h,_, from the previous time step into the current hidden state h,. 
We can use a function f to express the transformation of the RNN’s recurrent layer: 


h; = f(X;, hy-1). (10.7.1) 


In general, the encoder transforms the hidden states at all time steps into a context variable 
through a customized function q: 


c= q(h,..., hr). (10.7.2) 


For example, in Fig. 10.7.1, the context variable is just the hidden state hy correspond- 
ing to the encoder RNN’s representation after processing the final token of the input se- 
quence. 


In this example, we have used a unidirectional RNN to design the encoder, where the hidden 
state only depends on the input subsequence at and before the time step of the hidden state. 
We can also construct encoders using bidirectional RNNs. In this case, a hidden state 
depends on the subsequence before and after the time step (including the input at the current 
time step), which encodes the information of the entire sequence. 


Now let’s implement the RNN encoder. Note that we use an embedding layer to obtain 
the feature vector for each token in the input sequence. The weight of an embedding 
layer is a matrix, where the number of rows corresponds to the size of the input vocab- 
ulary (vocab_size) and number of columns corresponds to the feature vector’s dimension 
(embed_size). For any input token index i, the embedding layer fetches the i™ row (starting 
from 0) of the weight matrix to return its feature vector. Here we implement the encoder 
with a multilayer GRU. 


def init_seq2seq(module): #@save 
"""Tnitialize weights for sequence-to-sequence learning. 
if type(module) == nn.Linear: 
nn.init.xavier_uniform_(module. weight) 
if type(module) == nn.GRU: 
for param in module._flat_weights_names: 
if “weight” in param: 
nn.init.xavier_uniform_(module._parameters[param]) 


nnn 


class Seq2SeqEncoder(d21.Encoder): #@save 
"""The RNN encoder for sequence-to-sequence learning. 
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, 
dropout=@): 
super().__init__Q 
self.embedding = nn.Embedding(vocab_size, embed_size) 
self.rnn = d21.GRU(embed_size, num_hiddens, num_layers, dropout) 
self.apply(init_seq2seq) 


non 


def forward(self, X, xargs): 
# X shape: (batch_size, num_steps) 
embs = self.embedding(X.t().type(torch.int64)) 
# embs shape: (num_steps, batch_size, embed_size) 
outputs, state = self.rnn(embs) 


(continues on next page) 
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# outputs shape: (num_steps, batch_size, num_hiddens) 
# state shape: (num_layers, batch_size, num_hiddens) 
return outputs, state 


Let’s use a concrete example to illustrate the above encoder implementation. Below, we 
instantiate a two-layer GRU encoder whose number of hidden units is 16. Given a minibatch 
of sequence inputs X (batch size = 4; number of time steps = 9), the hidden states of the 
final layer at all the time steps (enc_outputs returned by the encoder’s recurrent layers) 
are a tensor of shape (number of time steps, batch size, number of hidden units). 


vocab_size, embed_size, num_hiddens, num_layers = 10, 8, 16, 2 
batch_size, num_steps = 4, 9 

encoder = Seq2SeqEncoder(vocab_size, embed_size, num_hiddens, num_layers) 
X = torch.zeros((batch_size, num_steps)) 

enc_outputs, enc_state = encoder (X) 

d21.check_shape(enc_outputs, (num_steps, batch_size, num_hiddens)) 


Since we are using a GRU here, the shape of the multilayer hidden states at the final time 
step is (number of hidden layers, batch size, number of hidden units). 


d21.check_shape(enc_state, (num_layers, batch_size, num_hiddens)) 


10.7.3 Decoder 


Given a target output sequence y1, y2,..., yz for each time step t’ (we use f’ to differentiate 
from the input sequence time steps), the decoder assigns a predicted probability to each 
possible token occurring at step y,4, conditioned upon the previous tokens in the target 
y1,---, yy and the context variable c, i.e., P(yy41 | y1,---5 Yr, ©). 


To predict the subsequent token tr’ + 1 in the target sequence, the RNN decoder takes the 
previous step’s target token yy, the hidden RNN state from the previous time step s;_1, 
and the context variable c as its input, and transforms them into the hidden state s; at the 
current time step. We can use a function g to express the transformation of the decoder’s 
hidden layer: 


Sy = g(Yr-1, C, Sr-1). (10.7.3) 


After obtaining the hidden state of the decoder, we can use an output layer and the softmax 
operation to compute the predictive distribution p(yv+1 | y1,- --, Yr, C) over the subse- 
quent output token ¢’ + 1. 


Following Fig. 10.7.1, when implementing the decoder as follows, we directly use the hid- 
den state at the final time step of the encoder to initialize the hidden state of the decoder. 
This requires that the RNN encoder and the RNN decoder have the same number of lay- 
ers and hidden units. To further incorporate the encoded input sequence information, the 
context variable is concatenated with the decoder input at all the time steps. To predict the 
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probability distribution of the output token, we use a fully connected layer to transform the 
hidden state at the final layer of the RNN decoder. 


class Seq2SeqDecoder (d21.Decoder): 
"""The RNN decoder for sequence to sequence learning. 
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, 
dropout=@) : 
super().__init__Q 
self.embedding = nn.Embedding(vocab_size, embed_size) 
self.rnn = d21.GRU(embed_size+num_hiddens, num_hiddens, 
num_layers, dropout) 
self.dense = nn.LazyLinear(vocab_size) 
self.apply(init_seq2seq) 


nnn 


def init_state(self, enc_all_outputs, xargs): 
return enc_all_outputs 


def forward(self, X, state): 
# X shape: (batch_size, num_steps) 
# embs shape: (num_steps, batch_size, embed_size) 
embs = self.embedding(X.t().type(torch.int32)) 
enc_output, hidden_state = state 
# context shape: (batch_size, num_hiddens) 
context = enc_output[-1] 
# Broadcast context to (num_steps, batch_size, num_hiddens) 
context = context.repeat(embs.shape[Q], 1, 1) 
# Concat at the feature dimension 
embs_and_context = torch.cat((embs, context), -1) 
outputs, hidden_state = self.rnn(embs_and_context, hidden_state) 
outputs = self.dense(outputs).swapaxes(@, 1) 
# outputs shape: (batch_size, num_steps, vocab_size) 
# hidden_state shape: (num_layers, batch_size, num_hiddens) 
return outputs, [Lenc_output, hidden_state] 


To illustrate the implemented decoder, below we instantiate it with the same hyperparam- 
eters from the aforementioned encoder. As we can see, the output shape of the decoder 
becomes (batch size, number of time steps, vocabulary size), where the final dimension of 
the tensor stores the predicted token distribution. 


decoder = Seq2SeqDecoder(vocab_size, embed_size, num_hiddens, num_layers) 
state = decoder. init_state(encoder(X)) 

dec_outputs, state = decoder(X, state) 

d21.check_shape(dec_outputs, (batch_size, num_steps, vocab_size)) 
d21.check_shape(state[1], (num_layers, batch_size, num_hiddens)) 


The layers in the above RNN encoder—decoder model are summarized in Fig. 10.7.2. 


10.7.4 Encoder—Decoder for Sequence-to-Sequence Learning 


Putting it all together in code yields the following: 
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Layers in an RNN encoder—decoder model. 


class Seq2Seq(d21.EncoderDecoder): #@save 
"""The RNN encoder--decoder for sequence to sequence learning. 
def __init__(self, encoder, decoder, tgt_pad, 1r): 
super().__init__(encoder, decoder) 
self.save_hyperparameters() 


nnn 


def validation_step(self, batch): 
Y_hat = self (*batch[:-1]) 
self.plot(’loss’, self.loss(Y_hat, batch[-1]), train=False) 


def configure_optimizers(self): 
# Adam optimizer is used here 
return torch.optim.Adam(self.parameters(), lr=self.1r) 


10.7.5 Loss Function with Masking 


At each time step, the decoder predicts a probability distribution for the output tokens. 
As with language modeling, we can apply softmax to obtain the distribution and calculate 
the cross-entropy loss for optimization. Recall from Section 10.5 that the special padding 
tokens are appended to the end of sequences and so sequences of varying lengths can be 
efficiently loaded in minibatches of the same shape. However, prediction of padding tokens 
should be excluded from loss calculations. To this end, we can mask irrelevant entries 
with zero values so that multiplication of any irrelevant prediction with zero equates to 
zero. 


@d21.add_to_class(Seq2Seq) 

def loss(self, Y_hat, Y): 
l = super(Seq2Seq, self).loss(Y_hat, Y, averaged=False) 
mask = (Y.reshape(-1) != self.tgt_pad).type(torch. float32) 
return (l x mask).sum() / mask.sum() 


10.7.6 Training 


Now we can create and train an RNN encoder—decoder model for sequence-to-sequence 
learning on the machine translation dataset. 


data = d21.MTFraEng(batch_size=128) 
embed_size, num_hiddens, num_layers, dropout = 256, 256, 2, 0.2 


(continues on next page) 
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encoder = Seq2SeqEncoder ( 

len(data.src_vocab), embed_size, num_hiddens, num_layers, dropout) 
decoder = Seq2SeqDecoder( 

len(data.tgt_vocab), embed_size, num_hiddens, num_layers, dropout) 
model = Seq2Seq(encoder, decoder, tgt_pad=data.tgt_vocab['<pad>'], 

1r=0. 005) 

trainer = d21.Trainer(max_epochs=30, gradient_clip_val=1, num_gpus=1) 
trainer.fit(model, data) 


— train_loss 
44 =-=- val_loss 


one 
~ a 
34\ a SY 


10.7.7 Prediction 


To predict the output sequence at each step, the predicted token from the previous time step 
is fed into the decoder as an input. One simple strategy is to sample whichever token that 
has been assigned by the decoder the highest probability when predicting at each step. As 
in training, at the initial time step the beginning-of-sequence (“<bos>’’) token is fed into the 
decoder. This prediction process is illustrated in Fig. 10.7.3. When the end-of-sequence 
(“<eos>’’) token is predicted, the prediction of the output sequence is complete. 


Encoder Decoder 
Ils regardent i <eos> 
F = F m F r7 LaF UFF UF 
They are watching 3 <eos> i 


<bos> 


Predicting the output sequence token by token using an RNN encoder-decoder. 


In the next section, we will introduce more sophisticated strategies based on beam search 
(Section 10.8). 


@d21.add_to_class(d21.EncoderDecoder) #@save 
def predict_step(self, batch, device, num_steps, 
save_attention_weights=False): 
batch = [a.to(device) for a in batch] 
src, tgt, src_valid_len, _ = batch 
enc_all_outputs = self.encoder(src, src_valid_len) 
dec_state = self.decoder.init_state(enc_all_outputs, src_valid_len) 


(continues on next page) 
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outputs, attention_weights = [tgtL:, (0)].unsqueeze(1), ], [] 
for _ in range(num_steps): 

Y, dec_state = self.decoder(outputs[-1], dec_state) 

outputs. append(Y.argmax(2)) 

# Save attention weights (to be covered later) 

if save_attention_weights: 

attention_weights.append(self.decoder.attention_weights) 

return torch.cat(outputs[1:], 1), attention_weights 


10.7.8 Evaluation of Predicted Sequences 


We can evaluate a predicted sequence by comparing it with the target sequence (the ground 
truth). But what precisely is the appropriate measure for comparing similarity between two 
sequences? 


Bilingual Evaluation Understudy (BLEU), though originally proposed for evaluating ma- 
chine translation results (Papineni et al., 2002), has been extensively used in measuring the 
quality of output sequences for different applications. In principle, for any n-gram (Section 
9.3.1) in the predicted sequence, BLEU evaluates whether this n-gram appears in the target 
sequence. 


Denote by pn the precision of an n-gram, defined as the ratio of the number of matched 
n-grams in the predicted and target sequences to the number of n-grams in the predicted 
sequence. To explain, given a target sequence A, B, C, D, E, F, and a predicted sequence 
A, B, B, C, D, we have pı = 4/5, p2 = 3/4, p3 = 1/3, and pa = 0. Now let lenjape} 
and lenprea be the numbers of tokens in the target sequence and the predicted sequence, 
respectively. Then, BLEU is defined as 


k 
l n 
exp (min (0.1 - =) ee (10.7.4) 


leNprea ne] 


where k is the longest n-gram for matching. 


Based on the definition of BLEU in (10.7.4), whenever the predicted sequence is the same 
as the target sequence, BLEU is 1. Moreover, since matching longer n-grams is more diffi- 
cult, BLEU assigns a greater weight when a longer n-gram has high precision. Specifically, 
when pn is fixed, Pr *" increases as n grows (the original paper uses Pil "). Furthermore, 
since predicting shorter sequences tends to yield a higher pn value, the coefficient before 
the multiplication term in (10.7.4) penalizes shorter predicted sequences. For example, 
when k = 2, given the target sequence A, B, C, D, E, F and the predicted sequence A, B, 
although pı = p2 = 1, the penalty factor exp(1 — 6/2) ~ 0.14 lowers the BLEU. 


We implement the BLEU measure as follows. 


def bleu(pred_seq, label_seq, k): #@save 
"""Compute the BLEU.””"” 
pred_tokens, label_tokens = pred_seq.split(’ '), label_seq.split(' ') 


(continues on next page) 
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len_pred, len_label = len(pred_tokens), len(label_tokens) 
score = math.exp(min(@, 1 - len_label / len_pred)) 
for n in range(1, min(k, len_pred) + 1): 
num_matches, label_subs = ð, collections.defaultdict(int) 
for i in range(len_label - n + 1): 
label_subs[' '.join(label_tokensLli: i + n])] += 1 
for i in range(len_pred - n + 1): 
if label_subs[’ '.join(pred_tokensLi: i + n])] > ð: 
num_matches += 1 
label_subs[’ '.join(pred_tokens[i: i + n])] -= 
score *= math.pow(num_matches / (len_pred - n + 1), math.pow(@.5, n)) 
return score 


In the end, we use the trained RNN encoder—decoder to translate a few English sentences 
into French and compute the BLEU of the results. 


’ 1 


engs = [’go .’, ‘i lost .’, ‘he\'’s calm .’, 'i\'m home ."] 
fras = [’va !', 'j\'ai perdu .', 'il est calme .’, ‘je suis chez moi .'] 
preds, _ = model.predict_step( 
data.build(engs, fras), d21.try_gpu(), data.num_steps) 
for en, fr, p in zip(engs, fras, preds): 
translation = [] 
for token in data.tgt_vocab.to_tokens(p): 
if token == '<eos>’: 
break 
translation. append( token) 
print(f'{en} => {translation}, bleu,’ 
f'{bleu(” ".join(translation), fr, k=2):.3f}') 


go . => [’va’, '!’], bleu,1.000 

i lost . => ["j’ai”, ‘perdu’, '.’], bleu,1.000 

he’s calm . => [’'elle’, ‘court’, '.'], bleu,2.000 

i'm home . => ['je', ‘suis’, ‘chez’, ‘moi’, '.'], bleu,1.000 


10.7.9 Summary 


Following the design of the encoder—decoder architecture, we can use two RNNs to design 
a model for sequence-to-sequence learning. In encoder—decoder training, the teacher forc- 
ing approach feeds original output sequences (in contrast to predictions) into the decoder. 
When implementing the encoder and the decoder, we can use multilayer RNNs. We can 
use masks to filter out irrelevant computations, such as when calculating the loss. For eval- 
uating output sequences, BLEU is a popular measure that matches n-grams between the 
predicted sequence and the target sequence. 


10.7.10 Exercises 
1. Can you adjust the hyperparameters to improve the translation results? 


2. Rerun the experiment without using masks in the loss calculation. What results do you 
observe? Why? 
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3. If the encoder and the decoder differ in the number of layers or the number of hidden 
units, how can we initialize the hidden state of the decoder? 


4. In training, replace teacher forcing with feeding the prediction at the previous time step 
into the decoder. How does this influence the performance? 


5. Rerun the experiment by replacing GRU with LSTM. 


6. Are there any other ways to design the output layer of the decoder? 


Discussions !°?. 


10.8 Beam Search 
l 


In Section 10.7, we introduced the encoder—decoder architecture, and the standard tech- 
niques for training them end-to-end. However, when it came to test-time prediction, we 
mentioned only the greedy strategy, where we select at each time step the token given the 
highest predicted probability of coming next, until, at some time step, we find that we have 
predicted the special end-of-sequence “<eos>” token. In this section, we will begin by for- 
malizing this greedy search strategy and identifying some problems that practitioners tend 
to run into. Subsequently, we compare this strategy with two alternatives: exhaustive search 
(illustrative but not practical) and beam search (the standard method in practice). 


Let’s begin by setting up our mathematical notation, borrowing conventions from Section 
10.7. At any time step t’, the decoder outputs predictions representing the probability of 
each token in the vocabulary coming next in the sequence (the likely value of y,+1), con- 
ditioned on the previous tokens y;,...,y, and the context variable c, produced by the 
encoder to represent the input sequence. To quantify computational cost, denote by Y the 
output vocabulary (including the special end-of-sequence token “<eos>’’). Let’s also spec- 
ify the maximum number of tokens of an output sequence as T’. Our goal is to search for 
an ideal output from all O(|Y 7") possible output sequences. Note that this slightly over- 
estimates the number of distinct outputs because there are no subsequent tokens once the 
“<eos>” token occurs. However, for our purposes, this number roughly captures the size 
of the search space. 


10.8.1 Greedy Search 


Consider the simple greedy search strategy from Section 10.7. Here, at any time step t’, 
we simply select the token with the highest conditional probability from Y, i.e., 


yr = argmax P(y | y1,..., 27-1, ©). (10.8.1) 
yey 


Once our model outputs “<eos>” (or we reach the maximum length T’) the output sequence 
is completed. 
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This strategy might look reasonable, and in fact it is not so bad! Considering how computa- 
tionally undemanding it is, you'd be hard pressed to get more bang for your buck. However, 
if we put aside efficiency for a minute, it might seem more reasonable to search for the most 
likely sequence, not the sequence of (greedily selected) most likely tokens. It turns out that 
these two objects can be quite different. The most likely sequence is the one that maximizes 
the expression Oi P(yv | y1,---,¥e-1,€). In our machine translation example, if the 
decoder truly recovered the probabilities of the underlying generative process, then this 
would give us the most likely translation. Unfortunately, there is no guarantee that greedy 
search will give us this sequence. 


Let’s illustrate it with an example. Suppose that there are four tokens “A”, “B”, “C”, and 
“<eos>” in the output dictionary. In Fig. 10.8.1, the four numbers under each time step rep- 
resent the conditional probabilities of generating “A”, “B”, “C”, and “<eos>” respectively, 
at that time step. 


Time step 1 2 3 4 


0.1 
0.4 
0.3 


0.2 


At each time step, greedy search selects the token with the highest conditional probability. 


At each time step, greedy search selects the token with the highest conditional probability. 
Therefore, the output sequence “A”, “B”, “C”, and “<eos>” will be predicted (Fig. 10.8.1). 
The conditional probability of this output sequence is 0.5 x 0.4 x 0.4 x 0.6 = 0.048. 


Next, let’s look at another example in Fig. 10.8.2. Unlike in Fig. 10.8.1, at time step 2 we 
select the token “C”, which has the second highest conditional probability. 


Time step 1 2 3 4 


The four numbers under each time step represent the conditional probabilities of 
generating “A”, “B”, “C”, and “<eos>” at that time step. At time step 2, the token “C”, 
which has the second highest conditional probability, is selected. 


Since the output subsequences at time steps 1 and 2, on which time step 3 is based, have 
changed from “A” and “B” in Fig. 10.8.1 to “A” and “C” in Fig. 10.8.2, the conditional 
probability of each token at time step 3 has also changed in Fig. 10.8.2. Suppose that 
we choose the token “B” at time step 3. Now time step 4 is conditional on the output 
subsequence at the first three time steps “A’, “C”, and “B”, which has changed from “A”, 
“B”, and “C” in Fig. 10.8.1. Therefore, the conditional probability of generating each token 
at time step 4 in Fig. 10.8.2 is also different from that in Fig. 10.8.1. As a result, the 
conditional probability of the output sequence “A”, “C”, “B”, and “<eos>” in Fig. 10.8.2 
is 0.5 x 0.3 x 0.6 x 0.6 = 0.054, which is greater than that of greedy search in Fig. 10.8.1. 
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In this example, the output sequence “A”, “B”, “C”, and “<eos>” obtained by the greedy 
search is not optimal. 


10.8.2 Exhaustive Search 


If the goal is to obtain the most likely sequence, we may consider using exhaustive search: 
enumerate all the possible output sequences with their conditional probabilities, and then 
output the one that scores the highest predicted probability. 


While this would certainly give us what we desire, it would come at a prohibitive com- 
putational cost of O(|Y me exponential in the sequence length and with an enormous 
base given by the vocabulary size. For example, when |Y| = 10000 and T’ = 10, both 
small numbers when compared with ones in real applications, we will need to evaluate 
10000! = 10%° sequences, which is already beyond the capabilities of any foreseeable 
computers. On the other hand, the computational cost of greedy search is O (|Y | T’): mirac- 
ulously cheap but far from optimal. For example, when |Y| = 10000 and T’ = 10, we only 
need to evaluate 10000 x 10 = 10° sequences. 


10.8.3 Beam Search 


You could view sequence decoding strategies as lying on a spectrum, with beam search 
striking a compromise between the efficiency of greedy search and the optimality of ex- 
haustive search. The most straightforward version of beam search is characterized by a 
single hyperparameter, the beam size, k. Let’s explain this terminology. At time step 1, 
we select the k tokens with the highest predicted probabilities. Each of them will be the 
first token of k candidate output sequences, respectively. At each subsequent time step, 
based on the k candidate output sequences at the previous time step, we continue to select 
k candidate output sequences with the highest predicted probabilities from k |Y | possible 
choices. 


Time step 1 Time step 2 Time step 3 
Candidates Candidates Candidates 


The process of beam search (beam size = 2; maximum length of an output sequence = 3). 
The candidate output sequences are A, C, AB, CE, ABD, and CED. 


Fig. 10.8.3 demonstrates the process of beam search with an example. Suppose that the 
output vocabulary contains only five elements: Y = {A, B, C, D, E}, where one of them is 


408 


153 
mgm 
E 


Modern Recurrent Neural Networks 


“<eos>”. Let the beam size be two and the maximum length of an output sequence be three. 
At time step 1, suppose that the tokens with the highest conditional probabilities P(y1 | €) 


are A and C. At time step 2, for all y2 € Y, we compute 
P(A, c) =P(A|c)P A,c), 
(A, y2 | c) = P(A | €)P(y2 | A, ©) (10.8.2) 
P(C, y2 | ¢) = P(C | ©)P(y2 | C, 0), 


and pick the largest two among these ten values, say P(A, B | c) and P(C, E | c). Then at 


time step 3, for all y3 € Y, we compute 
P(A, B, y3 | c) = P(A, B | c)P(y3 | A, B, c), (10.8.3) 
P(C, E, y3 | €) = P(C,E | ¢)P(y3 | C, E, c), o 


and pick the largest two among these ten values, say P(A, B,D | c) and P(C, E,D | ©). 
As a result, we get six candidates output sequences: (i) A; (ii) C; (iii) A, B; Gv) C, E; (v) 
A, B, D; and (vi) C, E, D. 


In the end, we obtain the set of final candidate output sequences based on these six se- 
quences (e.g., discard portions including and after “<eos>”). Then we choose the output 
sequence which maximizes the following score: 


L 
1 1 
Ta log PO, -YL |c) = Te 2,8 POr | y1- -Yv -1 C); (10.8.4) 


here L is the length of the final candidate sequence and a is usually set to 0.75. Since a 
longer sequence has more logarithmic terms in the summation of (10.8.4), the term L” in 
the denominator penalizes long sequences. 


The computational cost of beam search is O(k |Y| T’). This result is in between that of 
greedy search and that of exhaustive search. Greedy search can be treated as a special case 
of beam search arising when the beam size is set to 1. 


10.8.4 Summary 


Sequence searching strategies include greedy search, exhaustive search, and beam search. 
Beam search provides a trade-off between accuracy and computational cost via the flexible 
choice of the beam size. 


10.8.5 Exercises 
1. Can we treat exhaustive search as a special type of beam search? Why or why not? 


2. Apply beam search in the machine translation problem in Section 10.7. How does the 
beam size affect the translation results and the prediction speed? 


3. We used language modeling for generating text following user-provided prefixes in Sec- 
tion 9.5. Which kind of search strategy does it use? Can you improve it? 
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The earliest years of the deep learning boom were driven primarily by results produced us- 
ing the multilayer perceptron, convolutional network, and recurrent network architectures. 
Remarkably, the model architectures that underpinned many of deep learning’s break- 
throughs in the 2010s had changed remarkably little relative to their antecedents despite the 
lapse of nearly 30 years. While plenty of new methodological innovations made their way 
into most practitioner’s toolkits—ReLU activations, residual layers, batch normalization, 
dropout, and adaptive learning rate schedules come to mind—the core underlying archi- 
tectures were clearly recognizable as scaled-up implementations of classic ideas. Despite 
thousands of papers proposing alternative ideas, models resembling classical convolutional 
neural networks (Chapter 7) retained state-of-the-art status in computer vision and models 
resembling Sepp Hochreiter’s original design for the LSTM recurrent neural network (Sec- 
tion 10.1), dominated most applications in natural language processing. Arguably, to that 
point, the rapid emergence of deep learning appeared to be primarily attributable to shifts 
in the available computational resources (thanks to innovations in parallel computing with 
GPUs) and the availability of massive data resources (thanks to cheap storage and Internet 
services). While these factors may indeed remain the primary drivers behind this technol- 
ogy’s increasing power we are also witnessing, at long last, a sea change in the landscape 
of dominant architectures. 


At the present moment, the dominant models for nearly all natural language processing 
tasks are based on the Transformer architecture. Given any new task in natural language 
processing, the default first-pass approach is to grab a large Transformer-based pretrained 
model, (e.g., BERT (Devlin et al., 2018), ELECTRA (Clark et al., 2020), RoBERTa (Liu 
et al., 2019), or Longformer (Beltagy et al., 2020)) adapting the output layers as neces- 
sary, and fine-tuning the model on the available data for the downstream task. If you have 
been paying attention to the last few years of breathless news coverage centered on Ope- 
nAI’s large language models, then you have been tracking a conversation centered on the 
GPT-2 and GPT-3 Transformer-based models (Brown et al., 2020, Radford et al., 2019). 
Meanwhile, the vision Transformer has emerged as a default model for diverse vision tasks, 
including image recognition, object detection, semantic segmentation, and superresolution 
(Dosovitskiy et al., 2021, Liu et al., 2021). Transformers also showed up as competitive 
methods for speech recognition (Gulati et al., 2020), reinforcement learning (Chen et al., 
2021), and graph neural networks (Dwivedi and Bresson, 2020). 


The core idea behind the Transformer model is the attention mechanism, an innovation 
that was originally envisioned as an enhancement for encoder—-decoder RNNs applied to 
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sequence-to-sequence applications, such as machine translations (Bahdanau et al., 2014). 
You might recall that in the first sequence-to-sequence models for machine translation 
(Sutskever et al., 2014), the entire input was compressed by the encoder into a single fixed- 
length vector to be fed into the decoder. The intuition behind attention is that rather than 
compressing the input, it might be better for the decoder to revisit the input sequence at 
every step. Moreover, rather than always seeing the same representation of the input, one 
might imagine that the decoder should selectively focus on particular parts of the input se- 
quence at particular decoding steps. Bahdanau’s attention mechanism provided a simple 
means by which the decoder could dynamically attend to different parts of the input at each 
decoding step. The high-level idea is that the encoder could produce a representation of 
length equal to the original input sequence. Then, at decoding time, the decoder can (via 
some control mechanism) receive as input a context vector consisting of a weighted sum 
of the representations on the input at each time step. Intuitively, the weights determine the 
extent to which each step’s context “focuses” on each input token, and the key is to make 
this process for assigning the weights differentiable so that it can be learned along with all 
of the other neural network parameters. 


Initially, the idea was a remarkably successful enhancement to the recurrent neural net- 
works that already dominated machine translation applications. The models performed 
better than the original encoder—decoder sequence-to-sequence architectures. Furthermore, 
researchers noted that some nice qualitative insights sometimes emerged from inspecting 
the pattern of attention weights. In translation tasks, attention models often assigned high 
attention weights to cross-lingual synonyms when generating the corresponding words in 
the target language. For example, when translating the sentence “my feet hurt” to “f 
au pieds”, the neural network might assign high attention weights to the representation of 
“feet” when generating the corresponding French word “pieds”. These insights spurred 
claims that attention models confer “interpretability” although what precisely the atten- 
tion weights mean—.e., how, if at all, they should be interpreted remains a hazy research 


topic. 


ai mal 


However, attention mechanisms soon emerged as more significant concerns, beyond their 
usefulness as an enhancement for encoder—decoder recurrent neural networks and their pu- 
tative usefulness for picking out salient inputs. Vaswani et al. (2017) proposed the Trans- 
former architecture for machine translation, dispensing with recurrent connections alto- 
gether, and instead relying on cleverly arranged attention mechanisms to capture all rela- 
tionships among input and output tokens. The architecture performed remarkably well, and 
by 2018 the Transformer began showing up in the majority of state-of-the-art natural lan- 
guage processing systems. Moreover, at the same time, the dominant practice in natural lan- 
guage processing became to pretrain large-scale models on enormous generic background 
corpora to optimize some self-supervised pretraining objective, and then to fine-tune these 
models using the available downstream data. The gap between Transformers and traditional 
architectures grew especially wide when applied in this pretraining paradigm, and thus the 
ascendance of Transformers coincided with the ascendence of such large-scale pretrained 
models, now sometimes called foundation models (Bommasani et al., 2021). 


In this chapter, we introduce attention models, starting with the most basic intuitions and 
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the simplest instantiations of the idea. We then work our way up to the Transformer archi- 
tecture, the vision Transformer, and the landscape of modern Transformer-based pretrained 
models. 


11.1 Queries, Keys, and Values 


So far all the networks we have reviewed crucially relied on the input being of a well- 
defined size. For instance, the images in ImageNet are of size 224 x 224 pixels and CNNs 
are specifically tuned to this size. Even in natural language processing the input size for 
RNNs is well defined and fixed. Variable size is addressed by sequentially processing one 
token at a time, or by specially designed convolution kernels (Kalchbrenner et al., 2014). 
This approach can lead to significant problems when the input is truly of varying size with 
varying information content, such as in Section 10.7 in the transformation of text (Sutskever 
et al., 2014). In particular, for long sequences it becomes quite difficult to keep track of 
everything that has already been generated or even viewed by the network. Even explicit 
tracking heuristics such as proposed by Yang ef al. (2016) only offer limited benefit. 


Compare this to databases. In their simplest form they are collections of keys (k) and values 
(v). For instance, our database D might consist of tuples {(“Zhang’’, “Aston”), (“Lipton’’, 
“Zachary”), “Li”, “Mu’), (“Smola”, “Alex”), (“Hu’’, “Rachel’), (“Werness”, “Brent’)} 
with the last name being the key and the first name being the value. We can operate on 
D, for instance with the exact query (q) for “Li” which would return the value “Mu”. If 
(“Li”, “Mu’’) was not a record in D, there would be no valid answer. If we also allowed for 
approximate matches, we would retrieve (“Lipton”, “Zachary’’) instead. This quite simple 
and trivial example nonetheless teaches us a number of useful things: 


e We can design queries q that operate on (k,v) pairs in such a manner as to be valid 
regardless of the database size. 


e The same query can receive different answers, according to the contents of the database. 


e The “code” being executed for operating on a large state space (the database) can be quite 
simple (e.g., exact match, approximate match, top-k). 


e There is no need to compress or simplify the database to make the operations effective. 


Clearly we would not have introduced a simple database here if it wasn’t for the purpose of 
explaining deep learning. Indeed, this leads to one of the most exciting concepts introduced 
in deep learning in the past decade: the attention mechanism (Bahdanau et al., 2014). We 
will cover the specifics of its application to machine translation later. For now, simply 
consider the following: denote by D = {(k,, v1), ..- (Km, Vm) } a database of m tuples of 
keys and values. Moreover, denote by q a query. Then we can define the attention over D 
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as 
Attention(q, D) =)" (a, ki)vi, (11.1.1) 

i=l 
where a(q,k;) € R (i = 1,...,m) are scalar attention weights. The operation itself is 


typically referred to as attention pooling. The name attention derives from the fact that the 
operation pays particular attention to the terms for which the weight a is significant (i.e., 
large). As such, the attention over D generates a linear combination of values contained in 
the database. In fact, this contains the above example as a special case where all but one 
weight is zero. We have a number of special cases: 


e The weights a(q, k;) are nonnegative. In this case the output of the attention mechanism 
is contained in the convex cone spanned by the values v;. 


e The weights a(q, k;) form a convex combination, i.e., )}; a(q, k;) = 1 and a(q,k;) 2 0 
for all i. This is the most common setting in deep learning. 


e Exactly one of the weights a(q, k;) is 1, while all others are 0. This is akin to a traditional 
database query. 


e All weights are equal, i.e., a(q, k;) = 1 for all i. This amounts to averaging across the 
entire database, also called average pooling in deep learning. 


A common strategy for ensuring that the weights sum up to 1 is to normalize them via 


a(q, k;) 
X;@(q, kj)" 


In particular, to ensure that the weights are also nonnegative, one can resort to exponenti- 
ation. This means that we can now pick any function a(q, k) and then apply the softmax 
operation used for multinomial models to it via 


a(q, kj) = (11.1.2) 


exp(a(q, ki)) 
Z; exp(a(q, k;)) ` 


This operation is readily available in all deep learning frameworks. It is differentiable and 
its gradient never vanishes, all of which are desirable properties in a model. Note though, 
the attention mechanism introduced above is not the only option. For instance, we can 
design a non-differentiable attention model that can be trained using reinforcement learning 
methods (Mnih et al., 2014). As one would expect, training such a model is quite complex. 
Consequently the bulk of modern attention research follows the framework outlined in Fig. 
11.1.1. We thus focus our exposition on this family of differentiable mechanisms. 


a(q, ki) = (11.1.3) 


What is quite remarkable is that the actual “code” for executing on the set of keys and values, 
namely the query, can be quite concise, even though the space to operate on is significant. 
This is a desirable property for a network layer as it does not require too many parameters to 
learn. Just as convenient is the fact that attention can operate on arbitrarily large databases 
without the need to change the way the attention pooling operation is performed. 
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The attention mechanism computes a linear combination over values v; via attention 
pooling, where weights are derived according to the compatibility between a query q and 
keys k,. 


import torch 
from d21 import torch as d21 


11.1.1 Visualization 


One of the benefits of the attention mechanism is that it can be quite intuitive, particularly 
when the weights are nonnegative and sum to 1. In this case we might interpret large 
weights as a way for the model to select components of relevance. While this is a good 
intuition, it is important to remember that it is just that, an intuition. Regardless, we may 
want to visualize its effect on the given set of keys when applying a variety of different 
queries. This function will come in handy later. 


We thus define the show_heatmaps function. Note that it does not take a matrix (of attention 
weights) as its input but rather a tensor with four axes, allowing for an array of different 
queries and weights. Consequently the input matrices has the shape (number of rows 
for display, number of columns for display, number of queries, number of keys). This 
will come in handy later on when we want to visualize the workings that are to design 
Transformers. 


#@save 
def show_heatmaps(matrices, xlabel, ylabel, titles=None, figsize=(2.5, 2.5), 
cmap='Reds'): 
"""Show heatmaps of matrices. 
d21.use_svg_display() 
num_rows, num_cols, _, _ = matrices.shape 
fig, axes = d21.plt.subplots(num_rows, num_cols, figsize=figsize, 
sharex=True, sharey=True, squeeze=False) 
for i, (row_axes, row_matrices) in enumerate(zip(axes, matrices)): 
for j, (ax, matrix) in enumerate(zip(row_axes, row_matrices)): 
pcm = ax.imshow(matrix.detach().numpy(), cmap=cmap) 
if i == num_rows - 1: 
ax. set_xlabel (xlabel) 
if j = @: 
ax.set_ylabel(ylabel) 
if titles: 


nnn 


(continues on next page) 
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(continued from previous page) 


ax.set_title(titles[j]) 
fig.colorbar(pcm, ax=axes, shrink=0.6); 


As a quick sanity check let’s visualize the identity matrix, representing a case where the 
attention weight is 1 only when the query and the key are the same. 


attention_weights = torch.eye(10).reshape((1, 1, 10, 10)) 
show_heatmaps(attention_weights, xlabel='Keys'’, ylabel='Queries') 
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11.1.2 Summary 


The attention mechanism allows us to aggregate data from many (key, value) pairs. So 
far our discussion was quite abstract, simply describing a way to pool data. We have not 
explained yet where those mysterious queries, keys, and values might arise from. Some 
intuition might help here: for instance, in a regression setting, the query might correspond 
to the location where the regression should be carried out. The keys are the locations 
where past data was observed and the values are the (regression) values themselves. This 
is the so-called Nadaraya—Watson estimator (Nadaraya, 1964, Watson, 1964) that we will 
be studying in the next section. 


By design, the attention mechanism provides a differentiable means of control by which a 
neural network can select elements from a set and to construct an associated weighted sum 
over representations. 


11.1.3 Exercises 


1. Suppose that you wanted to reimplement approximate (key, query) matches as used in 
classical databases, which attention function would you pick? 


2. Suppose that the attention function is given by a(q,k;) = q'k; and that k; = v; for 
i=1,...,m. Denote by p(k;; q) the probability distribution over keys when using the 
softmax normalization in (11.1.3). Prove that V, Attention(q, D) = Covp x;:q) [ki]. 


3. Design a differentiable search engine using the attention mechanism. 


4. Review the design of the Squeeze and Excitation Networks (Hu et al., 2018) and interpret 
them through the lens of the attention mechanism. 
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11.2 Attention Pooling by Similarity 
a S| 


Now that we have introduced the primary components of the attention mechanism, let’s 
use them in a rather classical setting, namely regression and classification via kernel den- 
sity estimation (Nadaraya, 1964, Watson, 1964). This detour simply provides additional 
background: it is entirely optional and can be skipped if needed. At their core, Nadaraya— 
Watson estimators rely on some similarity kernel a(q, k) relating queries q to keys k. Some 
common kernels are 


1 
a(q, k) = exp (-5 lq- KI?] Gaussian; 


a(q,k) = 1if ||q—kl| <1 Boxcar; (11.2.1) 


a(q, k) = max (0, 1 — ||q — k||) Epanechikov. 


There are many more choices that we could pick. See a Wikipedia article 155 for a more 
extensive review and how the choice of kernels is related to kernel density estimation, some- 
times also called Parzen Windows (Parzen, 1957). All of the kernels are heuristic and can 
be tuned. For instance, we can adjust the width, not only on a global basis but even on a 
per-coordinate basis. Regardless, all of them lead to the following equation for regression 
and classification alike: 


(q, ki) 
f(a) = Ds TCR) akj (11.2.2) 


In the case of a (scalar) regression with observations (x;, y;) for features and labels respec- 
tively, v; = y; are scalars, k; = x; are vectors, and the query q denotes the new location 
where f should be evaluated. In the case of (multiclass) classification, we use one-hot- 
encoding of y; to obtain v;. One of the convenient properties of this estimator is that it re- 
quires no training. Even more so, if we suitably narrow the kernel with increasing amounts 
of data, the approach is consistent (Mack and Silverman, 1982), i.e., it will converge to 
some statistically optimal solution. Let’s start by inspecting some kernels. 


import numpy as np 

import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


d21.use_svg_display() 


11.2.1 Kernels and Data 


All the kernels a(k, q) defined in this section are translation and rotation invariant; that 
is, if we shift and rotate k and q in the same manner, the value of œ remains unchanged. 
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For simplicity we thus pick scalar arguments k, q € R and pick the key k = 0 as the origin. 
This yields: 


# Define some kernels 
def gaussian(x): 
return torch.exp(-x**2 / 2) 


def boxcar(x): 
return torch.abs(x) < 1.0 


def constant(x): 
return TORTOR 


def epanechikov(x): 
return torch.max(1 - torch.abs(x), torch.zeros_like(x)) 


fig, axes = d21.plt.subplots(1, 4, sharey=True, figsize=(12, 3)) 


kernels = (gaussian, boxcar, constant, epanechikov) 

names = ('Gaussian’, ‘Boxcar’, ‘Constant’, 'Epanechikov’') 

x = torch.arange(-2.5, 2.5, @.1) 

for kernel, name, ax in zip(kernels, names, axes): 
ax.plot(x.detach().numpy(), kernel(x).detach() .numpy()) 
ax.set_xlabel (name) 


d21.plt.show() 
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Different kernels correspond to different notions of range and smoothness. For instance, 
the boxcar kernel only attends to observations within a distance of 1 (or some otherwise 
defined hyperparameter) and does so indiscriminately. 


To see Nadaraya—Watson estimation in action, let’s define some training data. In the fol- 
lowing we use the dependency 


yi = 2 sin(x;) +X, + €, (11.2.3) 


where e€ is drawn from a normal distribution with zero mean and unit variance. We draw 
40 training examples. 
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def f(x): 
return 2 x torch.sin(x) + x 
n = 40 
x_train, _ = torch.sort(torch.rand(n) * 5) 


y_train = f(x_train) + torch.randn(n) 
x_val = torch.arange(Q, 5, 2.1) 
y_val = f(x_val) 


11.2.2 Attention Pooling via Nadaraya—Watson Regression 


Now that we have data and kernels, all we need is a function that computes the kernel 
regression estimates. Note that we also want to obtain the relative kernel weights in order 
to perform some minor diagnostics. Hence we first compute the kernel between all training 
features (covariates) x_train and all validation features x_val. This yields a matrix, which 
we subsequently normalize. When multiplied with the training labels y_train we obtain 
the estimates. 


Recall attention pooling in (11.1.1). Let each validation feature be a query, and each 
training feature—label pair be a key-value pair. As a result, the normalized relative ker- 
nel weights (attention_w below) are the attention weights. 


def nadaraya_watson(x_train, y_train, x_val, kernel): 
dists = x_train.reshape((-1, 1)) - x_val.reshape((1, -1)) 
# Each column/row corresponds to each query/key 
k = kernel(dists).type(torch. float32) 
# Normalization over keys for each query 
attention_w = k / k.sum(Q) 
y_hat = y_train@attention_w 
return y_hat, attention_w 


Let’s have a look at the kind of estimates that the different kernels produce. 


def plot(x_train, y_train, x_val, y_val, kernels, names, attention=False): 
fig, axes = d21.plt.subplots(1, 4, sharey=True, figsize=(12, 3)) 
for kernel, name, ax in zip(kernels, names, axes): 
y_hat, attention_w = nadaraya_watson(x_train, y_train, x_val, kernel) 
if attention: 
pcm = ax.imshow(attention_w.detach().numpy(), cmap='Reds') 
else: 
ax.plot(x_val, y_hat) 
ax.plot(x_val, y_val, 'm--') 
ax.plot(x_train, y_train, ‘0’, alpha=0.5); 
ax. set_xlabel (name) 
if not attention: 
ax.legend(L’y_hat’, ‘y']) 
if attention: 
fig.colorbar(pcem, ax=axes, shrink=Q.7) 
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plot(x_train, y_train, x_val, y_val, kernels, names) 
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The first thing that stands out is that all three nontrivial kernels (Gaussian, Boxcar, and 
Epanechikov) produce fairly workable estimates that are not too far from the true function. 
Only the constant kernel that leads to the trivial estimate f(x) = 1 Dui yi produces a rather 
unrealistic result. Let’s inspect the attention weighting a bit more closely: 


plot(x_train, y_train, x_val, y_val, kernels, names, attention=True) 
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The visualization clearly shows why the estimates for Gaussian, Boxcar, and Epanechikov 
are very similar: after all, they are derived from very similar attention weights, despite the 
different functional form of the kernel. This raises the question as to whether this is always 
the case. 


11.2.3 Adapting Attention Pooling 


We could replace the Gaussian kernel with one of a different width. That is, we could use 
a(q,k) = exp (- ilq- kli?) where o? determines the width of the kernel. Let’s see 


whether this affects the outcomes. 


sigmas = (0.1, 0.2, 0.5, 1) 
names = ['Sigma ' + str(sigma) for sigma in sigmas] 


def gaussian_with_width(sigma) : 
return (lambda x: torch.exp(-x**2 / (2*sigma**2))) 


kernels = [gaussian_with_width(sigma) for sigma in sigmas] 
plot(x_train, y_train, x_val, y_val, kernels, names) 
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Clearly, the narrower the kernel, the less smooth the estimate. At the same time, it adapts 
better to the local variations. Let’s look at the corresponding attention weights. 


plot(x_train, y_train, x_val, y_val, kernels, names, attention=True) 
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As we would expect, the narrower the kernel, the narrower the range of large attention 
weights. It is also clear that picking the same width might not be ideal. In fact, Silverman 
(1986) proposed a heuristic that depends on the local density. Many more such “tricks” 
have been proposed. For instance, Norelli et al. (2022) used a similar nearest-neighbor 
interpolation technique for designing cross-modal image and text representations. 


The astute reader might wonder why we are providing this deep dive for a method that is over 
half a century old. First, it is one of the earliest precursors of modern attention mechanisms. 
Second, itis great for visualization. Third, and just as importantly, it demonstrates the limits 
of hand-crafted attention mechanisms. A much better strategy is to Jearn the mechanism, 
by learning the representations for queries and keys. This is what we will embark on in the 
following sections. 


11.2.4 Summary 


Nadaraya—Watson kernel regression is an early precursor of the current attention mecha- 
nisms. It can be used directly with little to no training or tuning, either for classification or 
regression. The attention weight is assigned according to the similarity (or distance) be- 
tween query and key, and according to how many similar observations are available. 


11.2.5 Exercises 
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1. Parzen windows density estimates are given by f(x) = 15; k(x, x;). Prove that for 
binary classification the function p(x, y = 1) — p(x, y = —1), as obtained by Parzen 
windows is equivalent to Nadaraya—Watson classification. 


2. Implement stochastic gradient descent to learn a good value for kernel widths in Nadaraya— 
Watson regression. 


1. What happens if you just use the above estimates to minimize (f (xi) — y;)* directly? 
Hint: y; is part of the terms used to compute f. 


2. Remove (x;, yi) from the estimate for f(x;) and optimize over the kernel widths. 
Do you still observe overfitting? 


3. Assume that all x lie on the unit sphere, i.e., all satisfy ||x|| = 1. Can you simplify the 
\|x—x;,||* term in the exponential? Hint: we will later see that this is very closely related 
to dot product attention. 


4. Recall that Mack and Silverman (1982) proved that Nadaraya—Watson estimation is con- 
sistent. How quickly should you reduce the scale for the attention mechanism as you get 
more data? Provide some intuition for your answer. Does it depend on the dimension- 
ality of the data? How? 
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11.3 Attention Scoring Functions 
SSS ‘l uU E) 


In Section 11.2, we used a number of different distance-based kernels, including a Gaussian 
kernel to model interactions between queries and keys. As it turns out, distance functions 
are slightly more expensive to compute than dot products. As such, with the softmax op- 
eration to ensure nonnegative attention weights, much of the work has gone into attention 
scoring functions a in (11.1.3) and Fig. 11.3.1 that are simpler to compute. 


Attention © C] Output 


weights 


Attention 
scoring 
function 


I Computing the output of attention pooling as a weighted average of values, where weights 
are computed with the attention scoring function a and the softmax operation. 
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import math 

import torch 

from torch import nn 

from d21 import torch as d21 


11.3.1 Dot Product Attention 


Let’s review the attention function (without exponentiation) from the Gaussian kernel for 
a moment: 


1 1 1 
a(a.ki) = ~5lla- kil? = aki - 5llkil? - 5 lla. (11.3.1) 


First, note that the final term depends on q only. As such it is identical for all (q, k;) 
pairs. Normalizing the attention weights to 1, as is done in (11.1.3), ensures that this term 
disappears entirely. Second, note that both batch and layer normalization (to be discussed 
later) lead to activations that have well-bounded, and often constant, norms ||k;||. This is 
the case, for instance, whenever the keys k; were generated by a layer norm. As such, we 
can drop it from the definition of a without any major change in the outcome. 


Last, we need to keep the order of magnitude of the arguments in the exponential function 
under control. Assume that all the elements of the query q € R? and the key k; € R 
are independent and identically drawn random variables with zero mean and unit variance. 
The dot product between both vectors has zero mean and a variance of d. To ensure that 
the variance of the dot product still remains 1 regardless of vector length, we use the scaled 
dot product attention scoring function. That is, we rescale the dot product by 1/-Vd. We 
thus arrive at the first commonly used attention function that is used, e.g., in Transformers 
(Vaswani et al., 2017): 


alq, k;) = q"k;/Vd. (11.3.2) 
Note that attention weights «œ still need normalizing. We can simplify this further via 
(11.1.3) by using the softmax operation: 
exp(q'k;/Vd) 
X j-ı exp(q"k;/Vd) 


As it turns out, all popular attention mechanisms use the softmax, hence we will limit 
ourselves to that in the remainder of this chapter. 


a(q, k;) = softmax(a(q, k;)) = (11.3.3) 


11.3.2 Convenience Functions 


We need a few functions to make the attention mechanism efficient to deploy. This includes 
tools for dealing with strings of variable lengths (common for natural language processing) 
and tools for efficient evaluation on minibatches (batch matrix multiplication). 


Masked Softmax Operation 


One of the most popular applications of the attention mechanism is to sequence models. 
Hence we need to be able to deal with sequences of different lengths. In some cases, such 
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sequences may end up in the same minibatch, necessitating padding with dummy tokens 
for shorter sequences (see Section 10.5 for an example). These special tokens do not carry 
meaning. For instance, assume that we have the following three sentences: 


Dive into Deep Learning 
Learn to code <blank> 
Hello world <blank> <blank> 


Since we do not want blanks in our attention model we simply need to limit >" , a(q, k;)v; 
to 2i 1 «(q, k;)v; for however long, L < n, the actual sentence is. Since it is such a common 
problem, it has a name: the masked softmax operation. 


Let’s implementit. Actually, the implementation cheats ever so slightly by setting the values 
of v;, fori > L, to zero. Moreover, it sets the attention weights to a large negative number, 
such as —10°, in order to make their contribution to gradients and values vanish in practice. 
This is done since linear algebra kernels and operators are heavily optimized for GPUs and 
it is faster to be slightly wasteful in computation rather than to have code with conditional 
(if then else) statements. 


def masked_softmax(X, valid_lens): #@save 
"""Perform softmax operation by masking elements on the last axis. 
# X: 3D tensor, valid_lens: 1D or 2D tensor 
def _sequence_mask(X, valid_len, value=@): 
maxlen = X.size(1) 
mask = torch.arange((maxlen), dtype=torch.float32, 
device=X.device)[None, :] < valid_len[:, None] 


nnn 


X[~mask] = value 
return X 


if valid_lens is None: 
return nn.functional.softmax(X, dim=-1) 
else: 
shape = X.shape 
if valid_lens.dim() == 1: 
valid_lens = torch.repeat_interleave(valid_lens, shape[1]) 
else: 
valid_lens = valid_lens.reshape(-1) 
# On the last axis, replace masked elements with a very large negative 
# value, whose exponentiation outputs @ 
X = _sequence_mask(X.reshape(-1, shape[-1]), valid_lens, value=-1e6) 
return nn.functional.softmax(X.reshape(shape), dim=-1) 


To illustrate how this function works, consider a minibatch of two examples of size 2 x 4, 
where their valid lengths are 2 and 3, respectively. As a result of the masked softmax oper- 
ation, values beyond the valid lengths for each pair of vectors are all masked as zero. 


masked_softmax(torch.rand(2, 2, 4), torch.tensor([2, 3])) 


tensor ([[[0.4448, 0.5552, 0.0000, 0.00001], 
[0.4032, 0.5968, 0.0000, 0.0000]], 


(continues on next page) 
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[[0.2795, 0.2805, 0.4400, 0.0000], 
[0.2798, 0.3092, 0.4110, 0.0000]]]) 


If we need more fine-grained control to specify the valid length for each of the two vec- 
tors of every example, we simply use a two-dimensional tensor of valid lengths. This 
yields: 


masked_softmax(torch.rand(2, 2, 4), torch.tensor([[1, 3], [2, 4]])) 


tensor ([[[1.0000, 0.0000, 0.0000, 0.0000], 
[0.4109, 0.2794, 0.3097, 0.00001], 


[[0.3960, 0.6040, 0.0000, 0.0000], 
[@.2557, 0.1833, 0.2420, 0.3190]]]) 


Batch Matrix Multiplication 


Another commonly used operation is to multiply batches of matrices by one another. This 
comes in handy when we have minibatches of queries, keys, and values. More specifically, 
assume that 


Q = [Q1,Q,...,Qn] E RPP, 


11.3.4 
K = [K;, K2,..., Ky] e R”*bxc. ( ) 

Then the batch matrix multiplication (BMM) computes the elementwise product 
BMM(Q, K) = [Qi Ki, QoKo,...,Q,K,] € R™"°. (11.3.8) 


Let’s see this in action in a deep learning framework. 


Q = torch.ones((2, 3, 4)) 
K = torch.ones((2, 4, 6)) 
d21.check_shape(torch.bmm(Q, K), (2, 3, 6)) 


11.3.3 Scaled Dot Product Attention 


Let’s return to the dot product attention introduced in (11.3.2). In general, it requires that 
both the query and the key have the same vector length, say d, even though this can be 
addressed easily by replacing q'k with q’ Mk where M is a matrix suitably chosen for 
translating between both spaces. For now assume that the dimensions match. 


In practice, we often think of minibatches for efficiency, such as computing attention for 
n queries and m key-value pairs, where queries and keys are of length d and values are of 
length v. The scaled dot product attention of queries Q € R”*4, keys K € R”*4, and 
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values V € R’*” thus can be written as 
= 


K 
= Jv eR, (11.3.6) 


softmax | 


Note that when applying this to a minibatch, we need the batch matrix multiplication intro- 
duced in (11.3.5). In the following implementation of the scaled dot product attention, we 
use dropout for model regularization. 


class DotProductAttention(nn.Module): #@save 
"""Scaled dot product attention.””” 
def __init__(self, dropout): 
super().__init__Q 
self.dropout = nn.Dropout (dropout) 


# Shape of queries: (batch_size, no. of queries, d) 
# Shape of keys: (batch_size, no. of key-value pairs, d) 
# Shape of values: (batch_size, no. of key-value pairs, value dimension) 
# Shape of valid_lens: (batch_size,) or (batch_size, no. of queries) 
def forward(self, queries, keys, values, valid_lens=None): 
d = queries. shape[-1] 
# Swap the last two dimensions of keys with keys.transpose(1, 2) 
scores = torch.bmm(queries, keys.transpose(1, 2)) / math.sqrt(d) 
self.attention_weights = masked_softmax(scores, valid_lens) 
return torch.bmm(self.dropout(self.attention_weights), values) 


To illustrate how the DotProductAttention class works, we use the same keys, values, 
and valid lengths from the earlier toy example for additive attention. For the purpose of 
our example we assume that we have a minibatch size of 2, a total of 10 keys and values, 
and that the dimensionality of the values is 4. Lastly, we assume that the valid length per 
observation is 2 and 6 respectively. Given that, we expect the output to be a 2 x 1 x4 tensor, 
i.e., one row per example of the minibatch. 


queries = torch.normal(@, 1, (2, 1, 2)) 
keys = torch.normal(@, 1, (2, 10, 2)) 
values = torch.normal(@, 1, (2, 10, 4)) 
valid_lens = torch.tensor([2, 6]) 


attention = DotProductAttention(dropout=0.5) 


attention. eval() 
d21.check_shape(attention(queries, keys, values, valid_lens), (2, 1, 4)) 


Let’s check whether the attention weights actually vanish for anything beyond the second 
and sixth column respectively (because of setting the valid length to 2 and 6). 


d21.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)), 


xlabel='Keys’, ylabel='Queries’) 


11.3.4 Additive Attention 


When queries q and keys k are vectors of different dimension, we can either use a matrix to 
address the mismatch via q' Mk, or we can use additive attention as the scoring function. 
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Another benefit is that, as its name indicates, the attention is additive. This can lead to 
some minor computational savings. Given a query q € R1 and a key k € R*, the additive 
attention scoring function (Bahdanau et al., 2014) is given by 


a(q, k) = w, tanh(W,q + W;k) € R, (11.3.7) 


where W; € R”*4, W; € R”*K, and w, € R” are the learnable parameters. This term 
is then fed into a softmax to ensure both nonnegativity and normalization. An equivalent 
interpretation of (11.3.7) is that the query and key are concatenated and fed into an MLP 
with a single hidden layer. Using tanh as the activation function and disabling bias terms, 
we implement additive attention as follows: 


class AdditiveAttention(nn.Module): #@save 

"""Additive attention. ””” 

def __init__(self, num_hiddens, dropout, **kwargs): 
super (AdditiveAttention, self).__init__(**kwargs) 
self .W_k = nn.LazyLinear(num_hiddens, bias=False) 
self .W_q = nn.LazyLinear(num_hiddens, bias=False) 
self.w_v = nn.LazyLinear(1, bias=False) 
self.dropout = nn.Dropout (dropout) 


def forward(self, queries, keys, values, valid_lens): 
queries, keys = self.W_q(queries), self .W_k(keys) 
# After dimension expansion, shape of queries: (batch_size, no. of 
# queries, 1, num_hiddens) and shape of keys: (batch_size, 1, no. of 
# key-value pairs, num_hiddens). Sum them up with broadcasting 
features = queries.unsqueeze(2) + keys.unsqueeze(1) 
features = torch. tanh(features) 
# There is only one output of self.w_v, so we remove the last 
# one-dimensional entry from the shape. Shape of scores: (batch_size, 
# no. of queries, no. of key-value pairs) 
scores = self.w_v(features) .squeeze(-1) 
self.attention_weights = masked_softmax(scores, valid_lens) 
# Shape of values: (batch_size, no. of key-value pairs, value 
# dimension) 
return torch.bmm(self.dropout(self.attention_weights), values) 


Let’s see how AdditiveAttention works. In our toy example we pick queries, keys and 
values of size (2, 1,20), (2, 10, 2) and (2, 10, 4), respectively. This is identical to our choice 
for DotProductAttention, except that now the queries are 20-dimensional. Likewise, we 
pick (2, 6) as the valid lengths for the sequences in the minibatch. 


queries = torch.normal(@, 1, (2, 1, 20)) 


(continues on next page) 
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attention = AdditiveAttention(num_hiddens=8, dropout=@.1) 
attention. eval() 
d21.check_shape(attention(queries, keys, values, valid_lens), (2, 1, 4)) 


When reviewing the attention function we see a behavior that is qualitatively quite similar 
to that of DotProductAttention. That is, only terms within the chosen valid length (2, 6) 
are nonzero. 


d21.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)), 
xlabel='Keys’, ylabel='Queries’) 
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11.3.5 Summary 


In this section we introduced the two key attention scoring functions: dot product and addi- 
tive attention. They are effective tools for aggregating across sequences of variable length. 
In particular, the dot product attention is the mainstay of modern Transformer architectures. 
When queries and keys are vectors of different lengths, we can use the additive attention 
scoring function instead. Optimizing these layers is one of the key areas of advance in re- 
cent years. For instance, NVIDIA’s Transformer Library 157 and Megatron (Shoeybi et al., 
2019) crucially rely on efficient variants of the attention mechanism. We will dive into this 
in quite a bit more detail as we review Transformers in later sections. 


11.3.6 Exercises 


1. Implement distance-based attention by modifying the DotProductAttention code. Note 
that you only need the squared norms of the keys ||k;,||* for an efficient implementation. 


2. Modify the dot product attention to allow for queries and keys of different dimension- 
alities by employing a matrix to adjust dimensions. 


3. How does the computational cost scale with the dimensionality of the keys, queries, 
values, and their number? What about the memory bandwidth requirements? 


Discussions 158, 
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11.4 The Bahdanau Attention Mechanism 


When we encountered machine translation in Section 10.7, we designed an encoder—decoder 
architecture for sequence-to-sequence learning based on two RNNs (Sutskever et al., 2014). 
Specifically, the RNN encoder transforms a variable-length sequence into a fixed-shape 
context variable. Then, the RNN decoder generates the output (target) sequence token by 
token based on the generated tokens and the context variable. 


Recall Fig. 10.7.2 which we repeat (Fig. 11.4.1) with some additional detail. Convention- 
ally, in an RNN all relevant information about a source sequence is translated into some 
internal fixed-dimensional state representation by the encoder. It is this very state that is 
used by the decoder as the complete and exclusive source of information for generating the 
translated sequence. In other words, the sequence-to-sequence mechanism treats the inter- 
mediate state as a sufficient statistic of whatever string might have served as input. 


Encoder Decoder 


FC 
A 


nx Recurrent H state J Recurrent xn 
t ry 
Embedding Embedding 
+ A 
Sources Targets 


Sequence-to-sequence model. The state, as generated by the encoder, is the only piece of 
information shared between the encoder and the decoder. 


While this is quite reasonable for short sequences, it is clear that it is infeasible for long ones, 
such as a book chapter or even just a very long sentence. After all, before too long there will 
simply not be enough “space” in the intermediate representation to store all that is important 
in the source sequence. Consequently the decoder will fail to translate long and complex 
sentences. One of the first to encounter this was Graves (2013) who tried to design an 
RNN to generate handwritten text. Since the source text has arbitrary length they designed a 
differentiable attention model to align text characters with the much longer pen trace, where 
the alignment moves only in one direction. This, in turn, draws on decoding algorithms in 
speech recognition, e.g., hidden Markov models (Rabiner and Juang, 1993). 


Inspired by the idea of learning to align, Bahdanau et al. (2014) proposed a differentiable 
attention model without the unidirectional alignment limitation. When predicting a token, 
if not all the input tokens are relevant, the model aligns (or attends) only to parts of the input 
sequence that are deemed relevant to the current prediction. This is then used to update the 
current state before generating the next token. While quite innocuous in its description, this 
Bahdanau attention mechanism has arguably turned into one of the most influential ideas 
of the past decade in deep learning, giving rise to Transformers (Vaswani et al., 2017) and 
many related new architectures. 
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import torch 
from torch import nn 
from d21 import torch as d21 


11.4.1 Model 


We follow the notation introduced by the sequence-to-sequence architecture of Section 
10.7, in particular (10.7.3). The key idea is that instead of keeping the state, i.e., the con- 
text variable c summarizing the source sentence, as fixed, we dynamically update it, as a 
function of both the original text (encoder hidden states h,) and the text that was already 
generated (decoder hidden states s,;_,). This yields cy, which is updated after any decod- 
ing time step t’. Suppose that the input sequence is of length T. In this case the context 
variable is the output of attention pooling: 


T 
Cy = $ a(sr-1, hy) by. (11.4.1) 
t=1 


We used sy-1 as the query, and h; as both the key and the value. Note that cy is then 
used to generate the state s; and to generate a new token: see (10.7.3). In particular, the 
attention weight «œ is computed as in (11.3.3) using the additive attention scoring function 
defined by (11.3.7). This RNN encoder—decoder architecture using attention is depicted in 
Fig. 11.4.2. Note that later this model was modified so as to include the already generated 
tokens in the decoder as further context (i.e., the attention sum does not stop at T but rather 
it proceeds up to ¢’ — 1). For instance, see Chan et al. (2015) for a description of this 
strategy, as applied to speech recognition. 


Encoder Decoder 
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Layers in an RNN encoder—decoder model with the Bahdanau attention mechanism. 


11.4.2 Defining the Decoder with Attention 


To implement the RNN encoder—decoder with attention, we only need to redefine the de- 
coder (omitting the generated symbols from the attention function simplifies the design). 
Let’s begin with the base interface for decoders with attention by defining the quite unsur- 
prisingly named AttentionDecoder class. 
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class AttentionDecoder(d21.Decoder): #@save 


"""The base attention-based decoder interface. 


nnn 


def __init__(self): 
super().__init__Q 

@property 

def attention_weights(self): 


raise NotImplementedError 


We need to implement the RNN decoder in the Seq2SeqAttentionDecoder class. The 
state of the decoder is initialized with (i) the hidden states of the last layer of the encoder 
at all time steps, used as keys and values for attention; (ii) the hidden state of the encoder 
at all layers at the final time step, which serves to initialize the hidden state of the decoder; 
and (iii) the valid length of the encoder, to exclude the padding tokens in attention pooling. 
At each decoding time step, the hidden state of the final layer of the decoder, obtained at 
the previous time step, is used as the query of the attention mechanism. Both the output of 
the attention mechanism and the input embedding are concatenated to serve as the input of 
the RNN decoder. 


class Seq2SeqAttentionDecoder (AttentionDecoder) : 


def 


def 


def 


_init__(self, vocab_size, embed_size, num_hiddens, num_layers, 
dropout=@) : 
super().__init__Q 
self.attention = d21.AdditiveAttention(num_hiddens, dropout) 
self.embedding = nn.Embedding(vocab_size, embed_size) 
self.rnn = nn.GRUC( 
embed_size + num_hiddens, num_hiddens, num_layers, 
dropout=dropout) 
self.dense = nn.LazyLinear(vocab_size) 
self.apply(d21.init_seq2seq) 


init_state(self, enc_outputs, enc_valid_lens): 

# Shape of outputs: (num_steps, batch_size, num_hiddens). 

# Shape of hidden_state: (num_layers, batch_size, num_hiddens) 
outputs, hidden_state = enc_outputs 

return (outputs.permute(1, ð, 2), hidden_state, enc_valid_lens) 


forward(self, X, state): 
# Shape of enc_outputs: (batch_size, num_steps, num_hiddens). 
# Shape of hidden_state: (num_layers, batch_size, num_hiddens) 
enc_outputs, hidden_state, enc_valid_lens = state 
# Shape of the output X: (num_steps, batch_size, embed_size) 
X = self.embedding(X).permute(1, ð, 2) 
outputs, self._attention_weights = [], [] 
ror £6 mn Xe 

# Shape of query: (batch_size, 1, num_hiddens) 

query = torch.unsqueeze(hidden_state[-1], dim=1) 

# Shape of context: (batch_size, 1, num_hiddens) 

context = self.attention( 

query, enc_outputs, enc_outputs, enc_valid_lens) 
# Concatenate on the feature dimension 
x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1) 


(continues on next page) 
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# Reshape x as (1, batch_size, embed_size + num_hiddens) 
out, hidden_state = self.rnn(x.permute(1, 2, 2), hidden_state) 
outputs. append (out) 
self._attention_weights.append(self.attention.attention_weights) 
# After fully connected layer transformation, shape of outputs: 
# (num_steps, batch_size, vocab_size) 
outputs = self.dense(torch.cat(outputs, dim=0)) 
return outputs.permute(1, @, 2), [Lenc_outputs, hidden_state, 
enc_valid_lens] 


@property 
def attention_weights(self): 
return self._attention_weights 


In the following, we test the implemented decoder with attention using a minibatch of four 
sequences, each of which are seven time steps long. 


vocab_size, embed_size, num_hiddens, num_layers = 10, 8, 16, 2 

batch_size, num_steps = 4, 7 

encoder = d21.Seq2SeqEncoder(vocab_size, embed_size, num_hiddens, num_layers) 

decoder = Seq2SeqAttentionDecoder(vocab_size, embed_size, num_hiddens, 
num_layers) 

X = torch.zeros((batch_size, num_steps), dtype=torch. long) 

state = decoder.init_state(encoder(X), None) 

output, state = decoder(X, state) 

d21.check_shape(output, (batch_size, num_steps, vocab_size)) 

d21.check_shape(state[0], (batch_size, num_steps, num_hiddens)) 

d21.check_shape(state[1][0], (batch_size, num_hiddens)) 


11.4.3 Training 


Now that we specified the new decoder we can proceed analogously to Section 10.7.6: 
specify the hyperparameters, instantiate a regular encoder and a decoder with attention, 
and train this model for machine translation. 


data = d21.MTFraEng(batch_size=128) 
embed_size, num_hiddens, num_layers, dropout = 256, 256, 2, 0.2 
encoder = d21.Seq2SeqEncoder ( 

len(data.src_vocab), embed_size, num_hiddens, num_layers, dropout) 
decoder = Seq2SeqAttentionDecoder ( 

len(data.tgt_vocab), embed_size, num_hiddens, num_layers, dropout) 
model = d21.Seq2Seq(encoder, decoder, tgt_pad=data.tgt_vocab['<pad>’], 

lr=0. 005) 

trainer = d21.Trainer(max_epochs=30, gradient_clip_val=1, num_gpus=1) 
trainer.fit(model, data) 


After the model is trained, we use it to translate a few English sentences into French and 
compute their BLEU scores. 
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4 — train_loss 
==- val_loss 


engs = lieo a le ost ey mine Nes) calla ae Num home aT 
fras = [’va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .'] 
preds, _ = model.predict_step( 


data.build(engs, fras), d21.try_gpu(), data.num_steps) 
for en, fr, p in zip(engs, fras, preds): 

translation = [] 
for token in data.tgt_vocab.to_tokens(p): 

if token == '<eos>’: 

break 

translation. append(token) 

print(f'{en} => {translation}, bleu,’ 
f'{d21.bleu(" ".join(translation), fr, k=2):.3f}') 


go . => ['va', '!'], bleu,1.000 

i lost . => ["j’ai”, ‘perdu’, '.’], bleu,1.000 

he's calm . => [’il’, ‘court’, '.'’], bleu,@.000 

i'm home . => ['je', ‘suis’, ‘chez’, ‘moi’, '.'], bleu,1.000 


Let’s visualize the attention weights when translating the last English sentence. We see that 
each query assigns non-uniform weights over key—value pairs. It shows that at each decod- 
ing step, different parts of the input sequences are selectively aggregated in the attention 
pooling. 


_, dec_attention_weights = model.predict_step( 

data.build(Lengs[-1]], [fras[-1]]), d21.try_gpu(), data.num_steps, True) 
attention_weights = torch.cat( 

[stepl@][0]L0] for step in dec_attention_weights], @) 
attention_weights = attention_weights.reshape((1, 1, -1, data.num_steps)) 


# Plus one to include the end-of-sequence token 
d21.show_heatmaps( 
attention_weights[:, :, :, :len(engs[-1].splitQ) + 1].cpuQ), 
xlabel='Key positions’, ylabel=’Query positions’) 


11.4.4 Summary 


When predicting a token, if not all the input tokens are relevant, the RNN encoder—decoder 
with the Bahdanau attention mechanism selectively aggregates different parts of the input 
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sequence. This is achieved by treating the state (context variable) as an output of additive 
attention pooling. In the RNN encoder—decoder, the Bahdanau attention mechanism treats 
the decoder hidden state at the previous time step as the query, and the encoder hidden 
States at all the time steps as both the keys and values. 


11.4.5 Exercises 
1. Replace GRU with LSTM in the experiment. 


2. Modify the experiment to replace the additive attention scoring function with the scaled 
dot-product. How does it influence the training efficiency? 


Discussions 15°, 


11.5 Multi-Head Attention 
E) 


In practice, given the same set of queries, keys, and values we may want our model to 
combine knowledge from different behaviors of the same attention mechanism, such as 
capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a se- 
quence. Thus, it may be beneficial to allow our attention mechanism to jointly use different 
representation subspaces of queries, keys, and values. 


To this end, instead of performing a single attention pooling, queries, keys, and values can 
be transformed with h independently learned linear projections. Then these h projected 
queries, keys, and values are fed into attention pooling in parallel. In the end, h attention- 
pooling outputs are concatenated and transformed with another learned linear projection to 
produce the final output. This design is called multi-head attention, where each of the h 
attention pooling outputs is a head (Vaswani et al., 2017). Using fully connected layers to 
perform learnable linear transformations, Fig. 11.5.1 describes multi-head attention. 


import math 

import torch 

from torch import nn 

from d21 import torch as d21 
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Multi-head attention, where multiple heads are concatenated then linearly transformed. 


11.5.1 Model 


Before providing the implementation of multi-head attention, let’s formalize this model 
mathematically. Given a query q € R%, a key k € R%, and a value v € R®, each 
attention head h; (i = 1,..., h) is computed as 


h; = (Wq, Wk, W®v) € R”, (11.5.1) 


where wo E RPaxda, w € RPk*dk | and wo € RP»*® are learnable parameters 
and f is attention pooling, such as additive attention and scaled dot product attention in 
Section 11.3. The multi-head attention output is another linear transformation via learnable 
parameters W, € R?oxhPy of the concatenation of h heads: 


hı 
W, | : | € RP, (11.5.2) 


Based on this design, each head may attend to different parts of the input. More sophisti- 
cated functions than the simple weighted average can be expressed. 


11.5.2 Implementation 


In our implementation, we choose the scaled dot product attention for each head of the 
multi-head attention. To avoid significant growth of computational cost and parametriza- 
tion cost, we set py = Pk = Py = Po/h. Note that h heads can be computed in parallel 
if we set the number of outputs of linear transformations for the query, key, and value to 
Pgh = pkh = pyh = po. Inthe following implementation, po is specified via the argument 
num_hiddens. 


class MultiHeadAttention(d21.Module): #@save 

"""Multi-head attention. aan 

def __init__(self, num_hiddens, num_heads, dropout, bias=False, x»xkwargs): 
süper Or Sinite 
self.num_heads = num_heads 
self.attention = d21.DotProductAttention(dropout) 
self.W_q = nn.LazyLinear(num_hiddens, bias=bias) 
self .W_k = nn.LazyLinear(num_hiddens, bias=bias) 
self.W_v = nn.LazyLinear(num_hiddens, bias=bias) 


(continues on next page) 
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self.W_o = nn.LazyLinear(num_hiddens, bias=bias) 


def forward(self, queries, keys, values, valid_lens): 


# Shape of queries, keys, or values: 

# (batch_size, no. of queries or key-value pairs, num_hiddens) 

# Shape of valid_lens: (batch_size,) or (batch_size, no. of queries) 
# After transposing, shape of output queries, keys, or values: 

# (batch_size x num_heads, no. of queries or key-value pairs, 

# num_hiddens / num_heads) 

queries = self.transpose_qkv(self .W_q(queries)) 

keys = self.transpose_qkv(self .W_k(keys) ) 

values = self.transpose_qkv(self.W_v(values) ) 


if valid_lens is not None: 
# On axis @, copy the first item (scalar or vector) for num_heads 
# times, then copy the next item, and so on 
valid_lens = torch. repeat_interleave( 
valid_lens, repeats=self.num_heads, dim=0) 


# Shape of output: (batch_size * num_heads, no. of queries, 

# num_hiddens / num_heads) 

output = self.attention(queries, keys, values, valid_lens) 

# Shape of output_concat: (batch_size, no. of queries, num_hiddens) 
output_concat = self.transpose_output (output) 

return self .W_o(output_concat) 


To allow for parallel computation of multiple heads, the above MultiHeadAttention class 
uses two transposition methods as defined below. Specifically, the transpose_output 
method reverses the operation of the transpose_qkv method. 


@d21.add_to_class(MultiHeadAttention) #@save 
def transpose_qkv(self, X): 


"""Transposition for parallel computation of multiple attention heads. 


+ 


+H >< HH XK HH 


nnn 


Shape of input X: (batch_size, no. of queries or key-value pairs, 
num_hiddens). Shape of output X: (batch_size, no. of queries or 
key-value pairs, num_heads, num_hiddens / num_heads) 

= X.reshape(X.shape[@], X.shape[1], self.num_heads, -1) 

Shape of output X: (batch_size, num_heads, no. of queries or key-value 
pairs, num_hiddens / num_heads) 

= X.permute(@, 2, 1, 3) 

Shape of output: (batch_size * num_heads, no. of queries or key-value 
pairs, num_hiddens / num_heads) 


return X.reshape(-1, X.shape[2], X.shape[3]) 


@d21.add_to_class(MultiHeadAttention) #@save 
def transpose_output(self, X): 


"""Reverse the operation of transpose_qkv. 


X 
X 


nnn 


= X.reshape(-1, self.num_heads, X.shape[1], X.shape[2]) 
= X.permute(@, 2, 1, 3) 


return X.reshape(X.shapelQ], X.shape[1], -1) 


Let’s test our implemented MultiHeadAttention class using a toy example where keys 


435 


Self-Attention and Positional Encoding 


and values are the same. As a result, the shape of the multi-head attention output is 
(batch_size, num_queries, num_hiddens). 


num_hiddens, num_heads = 100, 5 

attention = MultiHeadAttention(num_hiddens, num_heads, 2.5) 

batch_size, num_queries, num_kvpairs = 2, 4, 6 

valid_lens = torch.tensor([3, 2]) 

X = torch.ones((batch_size, num_queries, num_hiddens)) 

Y = torch.ones((batch_size, num_kvpairs, num_hiddens)) 

d21.check_shape(attention(X, Y, Y, valid_lens), 
(batch_size, num_queries, num_hiddens)) 


11.5.3 Summary 


Multi-head attention combines knowledge of the same attention pooling via different repre- 
sentation subspaces of queries, keys, and values. To compute multiple heads of multi-head 
attention in parallel, proper tensor manipulation is needed. 


11.5.4 Exercises 


1. Visualize attention weights of multiple heads in this experiment. 


2. Suppose that we have a trained model based on multi-head attention and we want to 
prune less important attention heads to increase the prediction speed. How can we de- 
sign experiments to measure the importance of an attention head? 


Discussions 16°, 


11.6 Self-Attention and Positional Encoding 


In deep learning, we often use CNNs or RNNs to encode sequences. Now with attention 
mechanisms in mind, imagine feeding a sequence of tokens into an attention mechanism 
such that at every step, each token has its own query, keys, and values. Here, when comput- 
ing the value of a token’s representation at the next layer, the token can attend (via its query 
vector) to any other’s token (matching based on their key vectors). Using the full set of 
query-key compatibility scores, we can compute, for each token, a representation by build- 
ing the appropriate weighted sum over the other tokens. Because every token is attending 
to each other token (unlike the case where decoder steps attend to encoder steps), such 
architectures are typically described as self-attention models (Lin et al., 2017, Vaswani et 
al., 2017), and elsewhere described as intra-attention model (Cheng et al., 2016, Parikh 
et al., 2016, Paulus et al., 2017). In this section, we will discuss sequence encoding using 
self-attention, including using additional information for the sequence order. 
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import math 

import torch 

from torch import nn 

from d21 import torch as d21 


11.6.1 Self-Attention 


Given a sequence of input tokens x;,...,x, where any x; € RI (1 < i < n), its self- 
attention outputs a sequence of the same length y;,..., Yn, where 
Yi = f (xi, (X X1); -< (Xn Xn)) € RI (11.6.1) 


according to the definition of attention pooling in (11.1.1). Using multi-head attention, 
the following code snippet computes the self-attention of a tensor with shape (batch size, 
number of time steps or sequence length in tokens, d). The output tensor has the same 
shape. 


num_hiddens, num_heads = 100, 5 
attention = d21.MultiHeadAttention(num_hiddens, num_heads, 0.5) 
batch_size, num_queries, valid_lens = 2, 4, torch.tensor([3, 2]) 
X = torch.ones((batch_size, num_queries, num_hiddens)) 
d21.check_shape(attention(X, X, X, valid_lens), 

(batch_size, num_queries, num_hiddens)) 


11.6.2 Comparing CNNs, RNNs, and Self-Attention 


Let’s compare architectures for mapping a sequence of n tokens to another one of equal 
length, where each input or output token is represented by a d-dimensional vector. Specif- 
ically, we will consider CNNs, RNNs, and self-attention. We will compare their computa- 
tional complexity, sequential operations, and maximum path lengths. Note that sequential 
operations prevent parallel computation, while a shorter path between any combination of 
sequence positions makes it easier to learn long-range dependencies within the sequence 
(Hochreiter et al., 2001). 


Let’s regard any text sequence as a “one-dimensional image”. Similarly, one-dimensional 
CNNs can process local features such as n-grams in text. Given a sequence of length n, con- 
sider a convolutional layer whose kernel size is k, and whose numbers of input and output 
channels are both d. The computational complexity of the convolutional layer is O(knd’). 
As Fig. 11.6.1 shows, CNNs are hierarchical, so there are O(1) sequential operations and 
the maximum path length is O(n/k). For example, x; and xs are within the receptive field 
of a two-layer CNN with kernel size 3 in Fig. 11.6.1. 


When updating the hidden state of RNNs, multiplication of the d x d weight matrix and the 
d-dimensional hidden state has a computational complexity of O(d7). Since the sequence 
length is n, the computational complexity of the recurrent layer is O(nd’). According 
to Fig. 11.6.1, there are O(n) sequential operations that cannot be parallelized and the 
maximum path length is also O(n). 
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Self-attention 


| Comparing CNN (padding tokens are omitted), RNN, and self-attention architectures. 


In self-attention, the queries, keys, and values are all n x d matrices. Consider the scaled 
dot product attention in (11.3.6), where an nx d matrix is multiplied by a d xn matrix, then 
the output n X n matrix is multiplied by an n x d matrix. As a result, the self-attention has a 
O(n*d) computational complexity. As we can see from Fig. 11.6.1, each token is directly 
connected to any other token via self-attention. Therefore, computation can be parallel with 
O(1) sequential operations and the maximum path length is also O(1). 


All in all, both CNNs and self-attention enjoy parallel computation and self-attention has 
the shortest maximum path length. However, the quadratic computational complexity with 
respect to the sequence length makes self-attention prohibitively slow for very long se- 
quences. 


11.6.3 Positional Encoding 


Unlike RNNs, which recurrently process tokens of a sequence one-by-one, self-attention 
ditches sequential operations in favor of parallel computation. Note that self-attention by 
itself does not preserve the order of the sequence. What do we do if it really matters that 
the model knows in which order the input sequence arrived? 


The dominant approach for preserving information about the order of tokens is to represent 
this to the model as an additional input associated with each token. These inputs are called 
positional encodings, and they can either be learned or fixed a priori. We now describe a 
simple scheme for fixed positional encodings based on sine and cosine functions (Vaswani 
et al., 2017). 


Suppose that the input representation X € R’*@ contains the d-dimensional embeddings 
for n tokens of a sequence. The positional encoding outputs X + P using a positional 
embedding matrix P € R”*¢ of the same shape, whose element on the i® row and the 
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(27)"" or the (27 + 1) column is 


. l 
id (11.6.2) 


L 
Pi,2j+1 = COS (m i 


At first glance, this trigonometric function design looks weird. Before we give explanations 
of this design, let’s first implement it in the following PositionalEncoding class. 


class PositionalEncoding(nn.Module): #@save 
"""Positional encoding.”"" 
def __init__(self, num_hiddens, dropout, max_len=1000): 
super().__init__Q 
self.dropout = nn.Dropout (dropout) 
# Create a long enough P 
self.P = torch.zeros((1, max_len, num_hiddens)) 
X = torch.arange(max_len, dtype=torch. float32).reshape( 
-1, 1) / torch. pow(10000, torch. arange( 
@, num_hiddens, 2, dtype=torch.float32) / num_hiddens) 
self.P[L:, :, 0::2] = torch.sin(X) 
self.P[:, :, 1::2] = torch.cos(X) 


def forward(self, X): 
X =X + self.PL:, :X.shapel1], :].to(X.device) 
return self.dropout(X) 


In the positional embedding matrix P, rows correspond to positions within a sequence and 
columns represent different positional encoding dimensions. In the example below, we 
can see that the 6" and the 7" columns of the positional embedding matrix have a higher 
frequency than the 8" and the 9" columns. The offset between the 6" and the 7" (same for 
the 8" and the 9th) columns is due to the alternation of sine and cosine functions. 


encoding_dim, num_steps = 32, 60 

pos_encoding = PositionalEncoding(encoding_dim, 0) 

X = pos_encoding(torch.zeros((1, num_steps, encoding_dim))) 

P = pos_encoding.PL:, :X.shape[1], :] 

d21.plot(torch.arange(num_steps), PLQ, :, 6:10].T, xlabel='Row (position)’, 
figsize=(6, 2.5), legend=[”Col %d” % d for d in torch.arange(6, 10)]) 
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Absolute Positional Information 


To see how the monotonically decreased frequency along the encoding dimension relates 
to absolute positional information, let’s print out the binary representations of 0,1,...,7. 
As we can see, the lowest bit, the second-lowest bit, and the third-lowest bit alternate on 
every number, every two numbers, and every four numbers, respectively. 


for i in range(8): 
print(f'{i} in binary is {i:>03b}’) 


in binary is 000 
in binary is 001 
in binary is 010 
in binary is 011 
in binary is 100 
in binary is 101 
in binary is 110 
in binary is 111 


NOOB UNEO 


In binary representations, a higher bit has a lower frequency than a lower bit. Similarly, 
as demonstrated in the heat map below, the positional encoding decreases frequencies 
along the encoding dimension by using trigonometric functions. Since the outputs are float 
numbers, such continuous representations are more space-efficient than binary representa- 
tions. 


P = PLQ, :, :].unsqueeze(Q).unsqueeze(Q) 
d21.show_heatmaps(P, xlabel='Column (encoding dimension)’, 
ylabel='Row (position)’, figsize=(3.5, 4), cmap='Blues’) 
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Relative Positional Information 


Besides capturing absolute positional information, the above positional encoding also al- 
lows a model to easily learn to attend by relative positions. This is because for any fixed 
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position offset 6, the positional encoding at position i + 6 can be represented by a linear 
projection of that at position 7. 


This projection can be explained mathematically. Denoting w; = 1/ 100007//¢, any pair 
of (Pi,2j» Pi,2j+1) in (11.6.2) can be linearly projected to (pi+6,2;, Pi+6,2j+1) for any fixed 
offset 6: 
cos (dw ;) | | Pi 2j B 
—sin(dw;) cos(ðw;)| | Pi,2j+1 


cos(dw ;) sin(iw;) + sin(6w;) cos(iw ;) 
— sin(dw ;) sin(iw;) + cos(dw ;) cos(iw;) 
sin ((¿ + A 
cos ((i + 6)w;) 


-| Pi+6,2j | 
-= > 
Pi+6,2j+1 


(11.6.3) 


where the 2 x 2 projection matrix does not depend on any position index i. 


11.6.4 Summary 


In self-attention, the queries, keys, and values all come from the same place. Both CNNs 
and self-attention enjoy parallel computation and self-attention has the shortest maximum 
path length. However, the quadratic computational complexity with respect to the sequence 
length makes self-attention prohibitively slow for very long sequences. To use the sequence 
order information, we can inject absolute or relative positional information by adding po- 
sitional encoding to the input representations. 


11.6.5 Exercises 


1. Suppose that we design a deep architecture to represent a sequence by stacking self- 
attention layers with positional encoding. What could the possible issues be? 


2. Can you design a learnable positional encoding method? 


3. Can we assign different learned embeddings according to different offsets between queries 
and keys that are compared in self-attention? Hint: you may refer to relative position 
embeddings (Huang et al., 2018, Shaw et al., 2018). 


Discussions !°! 


11.7 The Transformer Architecture 
ee 


We have compared CNNs, RNNs, and self-attention in Section 11.6.2. Notably, self- 
attention enjoys both parallel computation and the shortest maximum path length. There- 
fore, it is appealing to design deep architectures by using self-attention. Unlike earlier 
self-attention models that still rely on RNNs for input representations (Cheng et al., 2016, 
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Lin et al., 2017, Paulus et al., 2017), the Transformer model is solely based on attention 
mechanisms without any convolutional or recurrent layer (Vaswani et al., 2017). Though 
originally proposed for sequence-to-sequence learning on text data, Transformers have been 
pervasive in a wide range of modern deep learning applications, such as in areas to do with 
language, vision, speech, and reinforcement learning. 


import math 

import pandas as pd 

import torch 

from torch import nn 

from d21 import torch as d21 


11.7.1 Model 


As an instance of the encoder—decoder architecture, the overall architecture of the Trans- 
former is presented in Fig. 11.7.1. As we can see, the Transformer is composed of an en- 
coder and a decoder. In contrast to Bahdanau attention for sequence-to-sequence learning 
in Fig. 11.4.2, the input (source) and output (target) sequence embeddings are added with 
positional encoding before being fed into the encoder and the decoder that stack modules 
based on self-attention. 


Now we provide an overview of the Transformer architecture in Fig. 11.7.1. At a high level, 
the Transformer encoder is a stack of multiple identical layers, where each layer has two 
sublayers (either is denoted as sublayer). The first is a multi-head self-attention pooling 
and the second is a positionwise feed-forward network. Specifically, in the encoder self- 
attention, queries, keys, and values are all from the outputs of the previous encoder layer. 
Inspired by the ResNet design of Section 8.6, a residual connection is employed around 
both sublayers. In the Transformer, for any input x € Rt at any position of the sequence, 
we require that sublayer(x) € R? so that the residual connection x + sublayer(x) € Rf is 
feasible. This addition from the residual connection is immediately followed by layer nor- 
malization (Ba et al., 2016). As a result, the Transformer encoder outputs a d-dimensional 
vector representation for each position of the input sequence. 


The Transformer decoder is also a stack of multiple identical layers with residual connec- 
tions and layer normalizations. As well as the two sublayers described in the encoder, the 
decoder inserts a third sublayer, known as the encoder—decoder attention, between these 
two. In the encoder—decoder attention, queries are from the outputs of the decoder’s self- 
attention sublayer, and the keys and values are from the Transformer encoder outputs. In 
the decoder self-attention, queries, keys, and values are all from the outputs of the previous 
decoder layer. However, each position in the decoder is allowed only to attend to all posi- 
tions in the decoder up to that position. This masked attention preserves the autoregressive 
property, ensuring that the prediction only depends on those output tokens that have been 
generated. 


We have already described and implemented multi-head attention based on scaled dot prod- 
ucts in Section 11.5 and positional encoding in Section 11.6.3. In the following, we will 
implement the rest of the Transformer model. 
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The Transformer architecture. 


11.7.2 Positionwise Feed-Forward Networks 


The positionwise feed-forward network transforms the representation at all the sequence 
positions using the same MLP. This is why we call it positionwise. In the implementation 
below, the input X with shape (batch size, number of time steps or sequence length in tokens, 
number of hidden units or feature dimension) will be transformed by a two-layer MLP into 
an output tensor of shape (batch size, number of time steps, ffn_num_outputs). 


class PositionWiseFFN(nn.Module): #@save 
"""The positionwise feed-forward network. ””” 
def __init__(self, ffn_num_hiddens, ffn_num_outputs): 
super().__init__Q 
self.densel = nn.LazyLinear(ffn_num_hiddens) 
self.relu = nn.ReLU() 
self.dense2 = nn.LazyLinear(ffn_num_outputs) 


def forward(self, X): 
return self.dense2(self.relu(self.dense1(X))) 


The following example shows that the innermost dimension of a tensor changes to the num- 
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ber of outputs in the positionwise feed-forward network. Since the same MLP transforms 
at all the positions, when the inputs at all these positions are the same, their outputs are also 
identical. 


ffn = PositionWiseFFN(4, 8) 
ffn.eval() 
ffn(torch.ones((2, 3, 4)))[0] 


tensor([L @.6300, 0.7739, @.0278, @.2508, -@.0519, @.4881, -@.4105, ð. 
5163], 

[ 0.6300, 0.7739, @.@278, @.2508, -0.0519, 0.4881, -0.4105, ð. 
5163], 

[ 0.6300, 0.7739, @.0278, @.2508, -@.0519, @.4881, -@.4105, 2. 
—5163]], 

grad_fn=<SelectBackwardQ>) 


11.7.3 Residual Connection and Layer Normalization 


Now let’s focus on the “add & norm” component in Fig. 11.7.1. As we described at the 
beginning of this section, this is a residual connection immediately followed by layer nor- 
malization. Both are key to effective deep architectures. 


In Section 8.5, we explained how batch normalization recenters and rescales across the 
examples within a minibatch. As discussed in Section 8.5.2, layer normalization is the same 
as batch normalization except that the former normalizes across the feature dimension, thus 
enjoying benefits of scale independence and batch size independence. Despite its pervasive 
applications in computer vision, batch normalization is usually empirically less effective 
than layer normalization in natural language processing tasks, where the inputs are often 
variable-length sequences. 


The following code snippet compares the normalization across different dimensions by 
layer normalization and batch normalization. 


In = nn.LayerNorm(2) 

bn = nn.LazyBatchNorm1d() 

X = torch.tensor([[1, 2], [2, 3]], dtype=torch. float32) 
# Compute mean and variance from X in the training mode 
print(’layer norm:’, 1n(X), ‘\nbatch norm:', bn(X)) 


layer norm: tensor([[-1.0000, 1.0000], 

[-1.0000, 1.0000]], grad_fn=<NativeLayerNormBackwardQ>) 
batch norm: tensor([[-1.0000, -1.0000], 

[ 1.0000, 1.0000]], grad_fn=<NativeBatchNormBackwardQ>) 


Now we can implement the AddNorm class using a residual connection followed by layer 
normalization. Dropout is also applied for regularization. 
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class AddNorm(nn.Module): #@save 
"""The residual connection followed by layer normalization. 
def __init__(self, norm_shape, dropout): 
super().__init__Q 
self.dropout = nn.Dropout (dropout) 
self.1n = nn.LayerNorm(norm_shape) 


non 


def forward(self, X, Y): 
return self.1ln(self.dropout(Y) + X) 


The residual connection requires that the two inputs are of the same shape so that the output 
tensor also has the same shape after the addition operation. 


add_norm = AddNorm(4, @.5) 
shape = (2, 3, 4) 
d21.check_shape(add_norm(torch.ones(shape), torch.ones(shape)), shape) 


11.7.4 Encoder 


With all the essential components to assemble the Transformer encoder, let’s start by im- 
plementing a single layer within the encoder. The following TransformerEncoderBlock 
class contains two sublayers: multi-head self-attention and positionwise feed-forward net- 
works, where a residual connection followed by layer normalization is employed around 
both sublayers. 


class TransformerEncoderBlock(nn.Module): #@save 
"""The Transformer encoder block.””” 
def __init__(self, num_hiddens, ffn_num_hiddens, num_heads, dropout, 
use_bias=False) : 
super().__init__Q 
self.attention = d21.MultiHeadAttention(num_hiddens, num_heads, 
dropout, use_bias) 

self.addnorm1l = AddNorm(num_hiddens, dropout) 
self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens) 
self.addnorm2 = AddNorm(num_hiddens, dropout) 


def forward(self, X, valid_lens): 
Y = self.addnorm1(X, self.attention(X, X, X, valid_lens)) 
return self.addnorm2(Y, self.ffn(Y)) 


As we can see, no layer in the Transformer encoder changes the shape of its input. 


X = torch.ones((2, 100, 24)) 

valid_lens = torch.tensor([3, 2]) 

encoder_blk = TransformerEncoderBlock(24, 48, 8, 0.5) 
encoder_blk. eval () 

d21.check_shape(encoder_blk(X, valid_lens), X.shape) 


In the following Transformer encoder implementation, we stack num_blks instances of the 
above TransformerEncoderBlock classes. Since we use the fixed positional encoding 
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whose values are always between —1 and 1, we multiply values of the learnable input em- 
beddings by the square root of the embedding dimension to rescale before summing up the 
input embedding and the positional encoding. 


class TransformerEncoder(d21.Encoder): #@save 
"""The Transformer encoder.”"" 
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, 
num_heads, num_blks, dropout, use_bias=False): 
super().__init__() 
self.num_hiddens = num_hiddens 
self.embedding = nn.Embedding(vocab_size, num_hiddens) 
self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout) 
self.blks = nn.Sequential() 
for i in range(num_blks): 
self.blks.add_module("block"+str(i), TransformerEncoderBlock( 
num_hiddens, ffn_num_hiddens, num_heads, dropout, use_bias)) 


def forward(self, X, valid_lens): 

# Since positional encoding values are between -1 and 1, the embedding 
# values are multiplied by the square root of the embedding dimension 
# to rescale before they are summed up 
X = self.pos_encoding(self.embedding(X) * math. sqrt(self.num_hiddens) ) 
self.attention_weights = [None] * len(self.blks) 
for i, blk in enumerate(self.blks): 

X = blk(X, valid_lens) 

self .attention_weightsL 

i] = blk.attention.attention.attention_weights 

return X 


Below we specify hyperparameters to create a two-layer Transformer encoder. The shape of 
the Transformer encoder output is (batch size, number of time steps, num_hiddens). 


encoder = TransformerEncoder(200, 24, 48, 8, 2, @.5) 
d21.check_shape(encoder(torch.ones((2, 100), dtype=torch.long), valid_lens), 
(2, 100, 24)) 


11.7.5 Decoder 


As shown in Fig. 11.7.1, the Transformer decoder is composed of multiple identical lay- 
ers. Each layer is implemented in the following TransformerDecoderBlock class, which 
contains three sublayers: decoder self-attention, encoder—decoder attention, and position- 
wise feed-forward networks. These sublayers employ a residual connection around them 
followed by layer normalization. 


As we described earlier in this section, in the masked multi-head decoder self-attention 
(the first sublayer), queries, keys, and values all come from the outputs of the previous 
decoder layer. When training sequence-to-sequence models, tokens at all the positions 
(time steps) of the output sequence are known. However, during prediction the output 
sequence is generated token by token; thus, at any decoder time step only the generated 
tokens can be used in the decoder self-attention. To preserve autoregression in the decoder, 
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its masked self-attention specifies dec_valid_lens so that any query only attends to all 
positions in the decoder up to the query position. 


class TransformerDecoderBlock(nn.Module) : 
# The i-th block in the Transformer decoder 


def 


def 


__init__(self, num_hiddens, ffn_num_hiddens, num_heads, dropout, i): 

super O e emitan) 

self.i = i 

self.attention1 = d2l.MultiHeadAttention(num_hiddens, num_heads, 
dropout) 

self.addnorm1 = AddNorm(num_hiddens, dropout) 

self.attention2 = d21.MultiHeadAttention(num_hiddens, num_heads, 
dropout) 

self.addnorm2 = AddNorm(num_hiddens, dropout) 

self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens) 

self.addnorm3 = AddNorm(num_hiddens, dropout) 


forward(self, X, state): 
enc_outputs, enc_valid_lens = state[@], state[1] 
# During training, all the tokens of any output sequence are processed 
# at the same time, so state[2][self.i] is None as initialized. When 
# decoding any output sequence token by token during prediction, 
# state[2][self.i] contains representations of the decoded output at 
# the i-th block up to the current time step 
if state[2][self.i] is None: 

key_values = X 
else: 

key_values = torch.cat((state[2][self.i], X), dim=1) 
state[2][self.i] = key_values 
if self.training: 


batch_size, num_steps, _ = X.shape 
# Shape of dec_valid_lens: (batch_size, num_steps), where every 
HIRON 1S) elewe2 NUM Steps] 


dec_valid_lens = torch.arange( 
1, num_steps + 1, device=X.device).repeat(batch_size, 1) 

else: 

dec_valid_lens = None 
# Self-attention 
X2 = self.attention1(X, key_values, key_values, dec_valid_lens) 
Y = self.addnorm1(X, X2) 
# Encoder-decoder attention. Shape of enc_outputs: 
# (batch_size, num_steps, num_hiddens) 
Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens) 
Z = self.addnorm2(Y, Y2) 
return self.addnorm3(Z, self.ffn(Z)), state 


To facilitate scaled dot product operations in the encoder—decoder attention and addition 
operations in the residual connections, the feature dimension (num_hiddens) of the decoder 
is the same as that of the encoder. 


decoder_blk = TransformerDecoderBlock(24, 48, 8, 0.5, ð) 
X = torch.ones((2, 100, 24)) 


state = 


[encoder_blk(X, valid_lens), valid_lens, [None]] 


d21.check_shape(decoder_b1k(X, state)[@], X.shape) 
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Now we construct the entire Transformer decoder composed of num_blks instances of 
TransformerDecoderBlock. In the end, a fully connected layer computes the prediction 
for all the vocab_size possible output tokens. Both of the decoder self-attention weights 
and the encoder—decoder attention weights are stored for later visualization. 


class TransformerDecoder(d21.AttentionDecoder) : 
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads, 

num_blks, dropout): 

super().__init__Q 

self.num_hiddens = num_hiddens 

self.num_blks = num_blks 

self.embedding = nn.Embedding(vocab_size, num_hiddens) 

self.pos_encoding = d21.PositionalEncoding(num_hiddens, dropout) 

self.blks = nn.Sequential() 

for i in range(num_blks): 

self.blks.add_module("block"+str(i), TransformerDecoderBlock( 

num_hiddens, ffn_num_hiddens, num_heads, dropout, i)) 

self.dense = nn.LazyLinear(vocab_size) 


def init_state(self, enc_outputs, enc_valid_lens): 
return [Lenc_outputs, enc_valid_lens, [None] * self.num_blks] 


def forward(self, X, state): 
X = self.pos_encoding(self.embedding(X) * math. sqrt(self.num_hiddens) ) 
self._attention_weights = [[None] * len(self.blks) for _ in range (2)] 
for i, blk in enumerate(self.blks): 
X, state = blk(X, state) 
# Decoder self-attention weights 
self._attention_weights[0][ 
i] = blk.attention1l.attention.attention_weights 
# Encoder-decoder attention weights 
self._attention_weights[1][ 
i] = blk.attention2.attention.attention_weights 
return self.dense(X), state 


@property 
def attention_weights(self): 
return self._attention_weights 


11.7.6 Training 


Let’s instantiate an encoder—decoder model by following the Transformer architecture. 
Here we specify that both the Transformer encoder and the Transformer decoder have two 
layers using 4-head attention. As in Section 10.7.6, we train the Transformer model for 
sequence-to-sequence learning on the English-French machine translation dataset. 


data = d21.MTFraEng(batch_size=128) 

num_hiddens, num_blks, dropout = 256, 2, 0.2 

ffn_num_hiddens, num_heads = 64, 4 

encoder = TransformerEncoder ( 
len(data.src_vocab), num_hiddens, ffn_num_hiddens, num_heads, 
num_blks, dropout) 

decoder = TransformerDecoder ( 


(continues on next page) 
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(continued from previous page) 


len(data.tgt_vocab), num_hiddens, ffn_num_hiddens, num_heads, 
num_blks, dropout) 
model = d21.Seq2Seq(encoder, decoder, tgt_pad=data.tgt_vocab['<pad>’], 
1r=0.001) 
trainer = d21.Trainer(max_epochs=30, gradient_clip_val=1, num_gpus=1) 
trainer.fit(model, data) 


— train_loss 
==- val_loss 


epoch 


After training, we use the Transformer model to translate a few English sentences into 
French and compute their BLEU scores. 


engs = [’go .’, ‘i lost .', ‘he\'’s calm .’, 'i\'m home ."] 
fras = [’va !', 'j\'ai perdu .', 'il est calme .’, 'je suis chez moi .'] 
preds, _ = model.predict_step( 


data.build(engs, fras), d21.try_gpu(), data.num_steps) 
for en, fr, p in zip(engs, fras, preds): 

translation = [] 
for token in data.tgt_vocab.to_tokens(p): 

if token == '<eos>’: 

break 

translation. append(token) 

print(f'{en} => {translation}, bleu,’ 
f'{d21.bleu(" ".join(translation), fr, k=2):.3f}') 


go . => [’va’, '!’'], bleu,1.000 

i lost . => ['je’, ‘perdu’, '.’], bleu,®.687 

he’s calm . => ['il’, ‘est’, 'mouillé’, '.'], bleu,@.658 

i'm home . => ['je', ‘suis’, ‘chez’, ‘moi’, '.'], bleu,1.000 


Let’s visualize the Transformer attention weights when translating the final English sen- 
tence into French. The shape of the encoder self-attention weights is (number of encoder 
layers, number of attention heads, num_steps or number of queries, num_steps or number 
of key-value pairs). 


_, dec_attention_weights = model.predict_step( 
data. build(Lengs[-1]], [fras[-1]]), d21.try_gpu(), data.num_steps, True) 
enc_attention_weights = torch.cat(model.encoder.attention_weights, Q) 
shape = (num_blks, num_heads, -1, data.num_steps) 
enc_attention_weights = enc_attention_weights.reshape(shape) 


(continues on next page) 
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(continued from previous page) 


d21.check_shape(enc_attention_weights, 
(num_blks, num_heads, data.num_steps, data.num_steps)) 


In the encoder self-attention, both queries and keys come from the same input sequence. 
Since padding tokens do not carry meaning, with specified valid length of the input se- 
quence no query attends to positions of padding tokens. In the following, two layers of 
multi-head attention weights are presented row by row. Each head independently attends 
based on a separate representation subspace of queries, keys, and values. 


d21.show_heatmaps ( 
enc_attention_weights.cpu(), xlabel='Key positions’, 
ylabel='Query positions’, titles=[’Head %d' % i for i in range(1, 5)], 
figsize=(7, 3.5)) 
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To visualize the decoder self-attention weights and the encoder—decoder attention weights, 
we need more data manipulations. For example, we fill the masked attention weights 
with zero. Note that the decoder self-attention weights and the encoder—decoder atten- 
tion weights both have the same queries: the beginning-of-sequence token followed by the 
output tokens and possibly end-of-sequence tokens. 


dec_attention_weights_2d = [head[@].tolist() 
for step in dec_attention_weights 
for attn in step for blk in attn for head in blk] 
dec_attention_weights_filled = torch. tensor( 
pd.DataFrame(dec_attention_weights_2d).fillna(@.®).values) 
shape = (-1, 2, num_blks, num_heads, data.num_steps) 
dec_attention_weights = dec_attention_weights_filled.reshape(shape) 
dec_self_attention_weights, dec_inter_attention_weights = \ 
dec_attention_weights.permute(1, 2, 3, @, 4) 


d21.check_shape(dec_self_attention_weights, 

(num_blks, num_heads, data.num_steps, data.num_steps)) 
d21.check_shape(dec_inter_attention_weights, 

(num_blks, num_heads, data.num_steps, data.num_steps)) 
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Because of the autoregressive property of the decoder self-attention, no query attends to 


key-value pairs after the query position. 


d21.show_heatmaps ( 
dec_self_attention_weightsL:, :, 


el, 


xlabel='Key positions’, ylabel='’Query positions’, 
titles=['’Head %d’ % i for i in range(1, 5)], figsize=(7, 3.5)) 
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Similar to the case in the encoder self-attention, via the specified valid length of the input 
sequence, no query from the output sequence attends to those padding tokens from the input 


sequence. 


d21.show_heatmaps( 


Key positions 


Key positions 


dec_inter_attention_weights, xlabel='Key positions’, 


ylabel='Query positions’, titles=[’Head %d' % i for i in range(1, 5)], 


figsize=(7, 3.5)) 
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Although the Transformer architecture was originally proposed for sequence-to-sequence 
learning, as we will discover later in the book, either the Transformer encoder or the Trans- 
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former decoder is often individually used for different deep learning tasks. 


11.7.7 Summary 


E = 


Transformers for Vision 


The Transformer is an instance of the encoder—decoder architecture, though either the en- 
coder or the decoder can be used individually in practice. In the Transformer architec- 
ture, multi-head self-attention is used for representing the input sequence and the output 
sequence, though the decoder has to preserve the autoregressive property via a masked 
version. Both the residual connections and the layer normalization in the Transformer are 
important for training a very deep model. The positionwise feed-forward network in the 
Transformer model transforms the representation at all the sequence positions using the 
same MLP. 


11.7.8 Exercises 


1. Train a deeper Transformer in the experiments. How does it affect the training speed 
and the translation performance? 


2. Is it a good idea to replace scaled dot product attention with additive attention in the 
Transformer? Why? 


3. For language modeling, should we use the Transformer encoder, decoder, or both? How 
would you design this method? 


4. What challenges can Transformers face if input sequences are very long? Why? 


5. How would you improve the computational and memory efficiency of Transformers? 
Hint: you may refer to the survey paper by Tay ef al. (2020). 
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11.8 Transformers for Vision 
SSS] 


The Transformer architecture was initially proposed for sequence-to-sequence learning, 
with a focus on machine translation. Subsequently, Transformers emerged as the model 
of choice in various natural language processing tasks (Brown et al., 2020, Devlin et al., 
2018, Radford et al., 2018, Radford et al., 2019, Raffel et al., 2020). However, in the field of 
computer vision the dominant architecture has remained the CNN (Chapter 8). Naturally, 
researchers started to wonder if it might be possible to do better by adapting Transformer 
models to image data. This question sparked immense interest in the computer vision com- 
munity. Recently, Ramachandran et al. (2019) proposed a scheme for replacing convolu- 
tion with self-attention. However, its use of specialized patterns in attention makes it hard 
to scale up models on hardware accelerators. Then, Cordonnier et al. (2020) theoretically 
proved that self-attention can learn to behave similarly to convolution. Empirically, 2 x 2 
patches were taken from images as inputs, but the small patch size makes the model only 
applicable to image data with low resolutions. 


Without specific constraints on patch size, vision Transformers (ViTs) extract patches from 
images and feed them into a Transformer encoder to obtain a global representation, which 
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will finally be transformed for classification (Dosovitskiy et al., 2021). Notably, Transform- 
ers show better scalability than CNNs: and when training larger models on larger datasets, 
vision Transformers outperform ResNets by a significant margin. Similar to the landscape 
of network architecture design in natural language processing, Transformers have also be- 
come a game-changer in computer vision. 


import torch 
from torch import nn 
from d21 import torch as d21 


11.8.1 Model 


Fig. 11.8.1 depicts the model architecture of vision Transformers. This architecture consists 
of a stem that patchifies images, a body based on the multilayer Transformer encoder, and 
a head that transforms the global representation into the output label. 


Rep.s» Rep; Rep, Rep; Rep, Reps; Reps Rep, Rep Reps 


The vision Transformer architecture. In this example, an image is split into nine patches. 
A special “<cls>” token and the nine flattened image patches are transformed via patch 
embedding and n Transformer encoder blocks into ten representations, respectively. The 
“<cls>” representation is further transformed into the output label. 


Consider an input image with height h, width w, and c channels. Specifying the patch 
height and width both as p, the image is split into a sequence of m = hw/p? patches, 
where each patch is flattened to a vector of length cp?. In this way, image patches can be 
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treated similarly to tokens in text sequences by Transformer encoders. A special “<cls>” 
(class) token and the m flattened image patches are linearly projected into a sequence of 
m + 1 vectors, summed with learnable positional embeddings. The multilayer Transformer 
encoder transforms m + | input vectors into the same number of output vector representa- 
tions of the same length. It works exactly the same way as the original Transformer encoder 
in Fig. 11.7.1, only differing in the position of normalization. Since the “<cls>” token at- 
tends to all the image patches via self-attention (see Fig. 11.6.1), its representation from 
the Transformer encoder output will be further transformed into the output label. 


11.8.2 Patch Embedding 


To implement a vision Transformer, let’s start with patch embedding in Fig. 11.8.1. Split- 
ting an image into patches and linearly projecting these flattened patches can be simplified 
as a single convolution operation, where both the kernel size and the stride size are set to 
the patch size. 


class PatchEmbedding(nn.Module) : 
def __init__(self, img_size=96, patch_size=16, num_hiddens=512): 
súper mito 
def _make_tuple(x): 
if not isinstance(x, (list, tuple)): 
return (x, x) 
return x 
img_size, patch_size = _make_tuple(img_size), _make_tuple(patch_size) 
self.num_patches = (img_size[Q] // patch_size[Q]) * ( 
img_size[1] // patch_size[1]) 
self.conv = nn.LazyConv2d(num_hiddens, kernel_size=patch_size, 
stride=patch_size) 


def forward(self, X): 
# Output shape: (batch size, no. of patches, no. of channels) 
return self.conv(X).flatten(2).transpose(1, 2) 


In the following example, taking images with height and width of img_size as inputs, the 
patch embedding outputs (img_size//patch_size)**2 patches that are linearly projected 
to vectors of length num_hiddens. 


img_size, patch_size, num_hiddens, batch_size = 96, 16, 512, 4 
patch_emb = PatchEmbedding(img_size, patch_size, num_hiddens) 
X = torch.zeros(batch_size, 3, img_size, img_size) 
d21.check_shape(patch_emb(X) , 

(batch_size, (img_size//patch_size)**2, num_hiddens)) 


11.8.3 Vision Transformer Encoder 


The MLP of the vision Transformer encoder is slightly different from the positionwise FFN 
of the original Transformer encoder (see Section 11.7.2). First, here the activation function 
uses the Gaussian error linear unit (GELU), which can be considered as a smoother version 
of the ReLU (Hendrycks and Gimpel, 2016). Second, dropout is applied to the output of 
each fully connected layer in the MLP for regularization. 
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class ViTMLP(nn.Module) : 
def __init__(self, mlp_num_hiddens, mlp_num_outputs, dropout=0.5): 
super().__init__Q 
self.densel = nn.LazyLinear(mlp_num_hiddens) 
self.gelu = nn.GELU() 
self.dropout1 = nn.Dropout (dropout) 
self.dense2 = nn.LazyLinear(mlp_num_outputs) 


self.dropout2 = nn.Dropout (dropout) 


def forward(self, x): 
return self.dropout2(self.dense2(self.dropout1(self.gelu( 
self .densel1(x))))) 


The vision Transformer encoder block implementation just follows the pre-normalization 
design in Fig. 11.8.1, where normalization is applied right before multi-head attention or 
the MLP. In contrast to post-normalization (“add & norm” in Fig. 11.7.1), where normal- 
ization is placed right after residual connections, pre-normalization leads to more effective 
or efficient training for Transformers (Baevski and Auli, 2018, Wang et al., 2019, Xiong et 
al., 2020). 


class ViTBlock(nn.Module) : 
def __init__(self, num_hiddens, norm_shape, mlp_num_hiddens, 
num_heads, dropout, use_bias=False): 
super().__init__() 
self.ln1 = nn.LayerNorm(norm_shape) 
self.attention = d21.MultiHeadAttention(num_hiddens, num_heads, 
dropout, use_bias) 


self.1n2 = nn.LayerNorm(norm_shape) 
self.mlp = ViTMLP(mlp_num_hiddens, num_hiddens, dropout) 


def forward(self, X, valid_lens=None): 
X = X + self.attention(*«([self.1n1(X)] * 3), valid_lens) 
return X + self.mlp(self.1n2(X)) 


Just as in Section 11.7.4, no vision Transformer encoder block changes its input shape. 


X = torch.ones((2, 100, 24)) 

encoder_blk = ViTBlock(24, 24, 48, 8, 0.5) 
encoder_blk.eval() 
d21.check_shape(encoder_b1k(X), X.shape) 


11.8.4 Putting It All Together 


The forward pass of vision Transformers below is straightforward. First, input images are 
fed into an PatchEmbedding instance, whose output is concatenated with the “<cls>” token 
embedding. They are summed with learnable positional embeddings before dropout. Then 
the output is fed into the Transformer encoder that stacks num_blks instances of the ViT- 
Block class. Finally, the representation of the “<cls>” token is projected by the network 
head. 


455 Transformers for Vision 


class ViT(d21.Classifier): 
"""Vision Transformer. 
def __init__(self, img_size, patch_size, num_hiddens, mlp_num_hiddens, 
num_heads, num_blks, emb_dropout, blk_dropout, 1r=0.1, 
use_bias=False, num_classes=10): 
super().__init__Q 
self.save_hyperparameters() 
self .patch_embedding = PatchEmbedding( 
img_size, patch_size, num_hiddens) 
self.cls_token = nn.Parameter(torch.zeros(1, 1, num_hiddens)) 
num_steps = self.patch_embedding.num_patches + 1 # Add the cls token 
# Positional embeddings are learnable 
self.pos_embedding = nn.Parameter( 
torch.randn(1, num_steps, num_hiddens)) 
self.dropout = nn.Dropout(emb_dropout) 
self.blks = nn.Sequential() 
for i in range(num_blks): 
self.blks.add_module(f"{i}”, ViTBlock( 
num_hiddens, num_hiddens, mlp_num_hiddens, 
num_heads, blk_dropout, use_bias)) 
self.head = nn.Sequential(nn.LayerNorm(num_hiddens) , 
nn.Linear(num_hiddens, num_classes)) 


nnn 


def forward(self, X): 
X = self.patch_embedding(X) 
X = torch.cat((self.cls_token.expand(X.shape[@], -1, -1), X), 1) 
X = self.dropout(X + self.pos_embedding) 
for blk in self.blks: 
X = blk(X) 
return self.head(X[:, ]) 


LLE i 


11.8.5 Training 


Training a vision Transformer on the Fashion-MNIST dataset is just like how CNNs were 
trained in Chapter 8. 


img_size, patch_size = 96, 16 

num_hiddens, mlp_num_hiddens, num_heads, num_blks = 512, 2048, 8, 2 

emb_dropout, blk_dropout, lr = @.1, @.1, @.1 

model = ViT(img_size, patch_size, num_hiddens, mlp_num_hiddens, num_heads, 
num_blks, emb_dropout, blk_dropout, Ir) 

trainer = d21.Trainer(max_epochs=10, num_gpus=1) 

data = d21.FashionMNIST(batch_size=128, resize=(img_size, img_size)) 

trainer.fit(model, data) 


11.8.6 Summary and Discussion 


You may have noticed that for small datasets like Fashion-MNIST, our implemented vision 
Transformer does not outperform the ResNet in Section 8.6. Similar observations can be 
made even on the ImageNet dataset (1.2 million images). This is because Transformers lack 
those useful principles in convolution, such as translation invariance and locality (Section 
7.1). However, the picture changes when training larger models on larger datasets (e.g., 
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300 million images), where vision Transformers outperform ResNets by a large margin 
in image classification, demonstrating intrinsic superiority of Transformers in scalability 
(Dosovitskiy et al., 2021). The introduction of vision Transformers has changed the land- 
scape of network design for modeling image data. They were soon shown to be effective on 
the ImageNet dataset with data-efficient training strategies of DeiT (Touvron et al., 2021). 
However, the quadratic complexity of self-attention (Section 11.6) makes the Transformer 
architecture less suitable for higher-resolution images. Towards a general-purpose back- 
bone network in computer vision, Swin Transformers addressed the quadratic computa- 
tional complexity with respect to image size (Section 11.6.2) and reinstated convolution- 
like priors, extending the applicability of Transformers to a range of computer vision tasks 
beyond image classification with state-of-the-art results (Liu et al., 2021). 


11.8.7 Exercises 
1. How does the value of img_size affect training time? 


2. Instead of projecting the “<cls>” token representation to the output, how would you 
project the averaged patch representations? Implement this change and see how it affects 
the accuracy. 


3. Can you modify hyperparameters to improve the accuracy of the vision Transformer? 
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11.9 Large-Scale Pretraining with Transformers 
a 


So far in our image classification and machine translation experiments, models have been 
trained on datasets with input—output examples from scratch to perform specific tasks. For 
example, a Transformer was trained with English-French pairs (Section 11.7) so that this 
model can translate input English text into French. As a result, each model becomes a 
specific expert that is sensitive to even a slight shift in data distribution (Section 4.7). For 
better generalized models, or even more competent generalists that can perform multiple 
tasks with or without adaptation, pretraining models on large data has been increasingly 
common. 
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Given larger data for pretraining, the Transformer architecture performs better with an in- 
creased model size and training compute, demonstrating superior scaling behavior. Specif- 
ically, performance of Transformer-based language models scales as a power law with the 
amount of model parameters, training tokens, and training compute (Kaplan et al., 2020). 
The scalability of Transformers is also evidenced by the significantly boosted performance 
from larger vision Transformers trained on larger data (discussed in Section 11.8). More 
recent success stories include Gato, a generalist model that can play Atari, caption im- 
ages, chat, and act as a robot (Reed et al., 2022). Gato is a single Transformer that scales 
well when pretrained on diverse modalities, including text, images, joint torques, and but- 
ton presses. Notably, all such multimodal data is serialized into a flat sequence of tokens, 
which can be processed akin to text tokens (Section 11.7) or image patches (Section 11.8) 
by Transformers. 


Prior to the compelling success of pretraining Transformers for multimodal data, Trans- 
formers were extensively pretrained with a wealth of text. Originally proposed for machine 
translation, the Transformer architecture in Fig. 11.7.1 consists of an encoder for represent- 
ing input sequences and a decoder for generating target sequences. Primarily, Transformers 
can be used in three different modes: encoder-only, encoder—decoder, and decoder-only. 
To conclude this chapter, we will review these three modes and explain the scalability in 
pretraining Transformers. 


11.9.1 Encoder-Only 


When only the Transformer encoder is used, a sequence of input tokens is converted into the 
same number of representations that can be further projected into output (e.g., classifica- 
tion). A Transformer encoder consists of self-attention layers, where all input tokens attend 
to each other. For example, vision Transformers depicted in Fig. 11.8.1 are encoder-only, 
converting a sequence of input image patches into the representation of a special “<cls>” 
token. Since this representation depends on all input tokens, it is further projected into 
classification labels. This design was inspired by an earlier encoder-only Transformer pre- 
trained on text: BERT (Bidirectional Encoder Representations from Transformers) (Devlin 
et al., 2018). 


Pretraining BERT 


BERT is pretrained on text sequences using masked language modeling: input text with 
randomly masked tokens is fed into a Transformer encoder to predict the masked tokens. 
As illustrated in Fig. 11.9.1, an original text sequence “T’, “love”, “this”, “red”, “car” is 
prepended with the “<cls>” token, and the “<mask>” token randomly replaces “love”; 
then the cross-entropy loss between the masked token “love” and its prediction is to be 
minimized during pretraining. Note that there is no constraint in the attention pattern of 
Transformer encoders (right of Fig. 11.9.1) so all tokens can attend to each other. Thus, 
prediction of “love” depends on input tokens before and after it in the sequence. This is 
why BERT is a “bidirectional encoder”. Without need for manual labeling, large-scale text 
data from books and Wikipedia can be used for pretraining BERT. 
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Left: Pretraining BERT with masked language modeling. Prediction of the masked “love” 
token depends on all input tokens before and after “love”. Right: Attention pattern in the 
Transformer encoder. Each token along the vertical axis attends to all input tokens along 
the horizontal axis. 


Fine-Tuning BERT 


The pretrained BERT can be fine-tuned to downstream encoding tasks involving single text 
or text pairs. During fine-tuning, additional layers can be added to BERT with randomized 
parameters: these parameters and those pretrained BERT parameters will be updated to fit 
training data of downstream tasks. 


Positive 


Transformer encoder 


<cls> This show is not bad 


Fine-tuning BERT for sentiment analysis. 


Fig. 11.9.2 illustrates fine-tuning of BERT for sentiment analysis. The Transformer encoder 
is a pretrained BERT, which takes a text sequence as input and feeds the “<cls>” represen- 
tation (global representation of the input) into an additional fully connected layer to predict 
the sentiment. During fine-tuning, the cross-entropy loss between the prediction and the 
label on sentiment analysis data is minimized via gradient-based algorithms, where the 
additional layer is trained from scratch while pretrained parameters of BERT are updated. 
BERT does more than sentiment analysis. The general language representations learned 
by the 350-million-parameter BERT from 250 billion training tokens advanced the state of 
the art for natural language tasks such as single text classification, text pair classification or 
regression, text tagging, and question answering. 


You may note that these downstream tasks include text pair understanding. BERT pretrain- 
ing has another loss for predicting whether one sentence immediately follows the other. 
However, this loss was later found to be less useful when pretraining ROBERTa, a BERT 
variant of the same size, on 2000 billion tokens (Liu et al., 2019). Other derivatives of 
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BERT improved model architectures or pretraining objectives, such as ALBERT (enforc- 
ing parameter sharing) (Lan et al., 2019), SpanBERT (representing and predicting spans of 
text) (Joshi et al., 2020), DistiIBERT (lightweight via knowledge distillation) (Sanh et al., 
2019), and ELECTRA (replaced token detection) (Clark et al., 2020). Moreover, BERT in- 
spired Transformer pretraining in computer vision, such as with vision Transformers (Doso- 
vitskiy etal., 2021), Swin Transformers (Liu et al., 2021), and MAE (masked autoencoders) 
(He et al., 2022). 


11.9.2 Encoder—Decoder 


Since a Transformer encoder converts a sequence of input tokens into the same number 
of output representations, the encoder-only mode cannot generate a sequence of arbitrary 
length as in machine translation. As originally proposed for machine translation, the Trans- 
former architecture can be outfitted with a decoder that autoregressively predicts the tar- 
get sequence of arbitrary length, token by token, conditional on both encoder output and 
decoder output: (i) for conditioning on encoder output, encoder—decoder cross-attention 
(multi-head attention of decoder in Fig. 11.7.1) allows target tokens to attend to all input 
tokens; (ii) conditioning on decoder output is achieved by a so-called causal attention (this 
name is common in the literature but is misleading as it has little connection to the proper 
study of causality) pattern (masked multi-head attention of decoder in Fig. 11.7.1), where 
any target token can only attend to past and present tokens in the target sequence. 


To pretrain encoder—decoder Transformers beyond human-labeled machine translation data, 
BART (Lewis et al., 2019) and T5 (Raffel et al., 2020) are two concurrently proposed 
encoder—decoder Transformers pretrained on large-scale text corpora. Both attempt to re- 
construct original text in their pretraining objectives, while the former emphasizes noising 
input (e.g., masking, deletion, permutation, and rotation) and the latter highlights multitask 
unification with comprehensive ablation studies. 


Pretraining T5 


As an example of the pretrained Transformer encoder—decoder, T5 (Text-to-Text Transfer 
Transformer) unifies many tasks as the same text-to-text problem: for any task, the input 
of the encoder is a task description (e.g., “Summarize”, “:”) followed by task input (e.g., 
a sequence of tokens from an article), and the decoder predicts the task output (e.g., a 
sequence of tokens summarizing the input article). To perform as text-to-text, T5 is trained 


to generate some target text conditional on input text. 


To obtain input and output from any original text, T5 is pretrained to predict consecu- 
tive spans. Specifically, tokens from text are randomly replaced by special tokens where 
each consecutive span is replaced by the same special token. Consider the example in Fig. 
11.9.3, where the original text is “T’, “love”, “this”, “red”, “car”. Tokens “love”, “red”, 
“car” are randomly replaced by special tokens. Since “red” and “car” are a consecutive 
span, they are replaced by the same special token. As a result, the input sequence is “T’, 


“<X>”, “this”, “<Y>”, and the target sequence is “<X>”, “love”, “<Y>”, “red”, “car”, 
“<Z>”, where “<Z>” is another special token marking the end. As shown in Fig. 11.9.3, 
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<X> love <Y> red car <Z> 


Transformer encoder 


<X> this <Y> | <X>this<Y> <X>love<Y> red car <Z> 


Attention input 


“this”, “red”, “car”, where “love” is replaced by a special “<X>” token, and consecutive 
“red”, “car” are replaced by a special “<Y>” token. The target sequence ends with a 
special “<Z>” token. Right: Attention pattern in the Transformer encoder—decoder. In the 
encoder self-attention (lower square), all input tokens attend to each other; In the 
encoder—decoder cross-attention (upper rectangle), each target token attends to all input 
tokens; In the decoder self-attention (upper triangle), each target token attends to present 
and past target tokens only (causal). 


the decoder has a causal attention pattern to prevent itself from attending to future tokens 
during sequence prediction. 


In T5, predicting consecutive span is also referred to as reconstructing corrupted text. 
With this objective, T5 is pretrained with 1000 billion tokens from the C4 (Colossal Clean 
Crawled Corpus) data, which consists of clean English text from the web (Raffel et al., 
2020). 


Fine-Tuning T5 


Similar to BERT, T5 needs to be fine-tuned (updating T5 parameters) on task-specific train- 
ing data to perform this task. Major differences from BERT fine-tuning include: (i) T5 
input includes task descriptions; (ii) T5 can generate sequences with arbitrary length with 
its Transformer decoder; (iii) No additional layers are required. 


Fig. 11.9.4 explains fine-tuning T5 using text summarization as an example. In this down- 
stream task, the task description tokens “Summarize”, “:” followed by the article tokens 
are input to the encoder. 


After fine-tuning, the 11-billion-parameter T5 (T5-11B) achieved state-of-the-art results on 
multiple encoding (e.g., classification) and generation (e.g., summarization) benchmarks. 
Since released, T5 has been extensively used in later research. For example, switch Trans- 
formers are designed based on T5 to activate a subset of the parameters for better computa- 
tional efficiency (Fedus et al., 2022). In a text-to-image model called Imagen, text is input 
to a frozen T5 encoder (T5-XXL) with 4.6 billion parameters (Saharia et al., 2022). The 
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Summary 


Oil prices extend slides 


Summarize : Oil prices slipped ... diesel taxes 
+----> +------------------ > 


Task description Article 


Fine-tuning T5 for text summarization. Both the task description and article tokens are 
fed into the Transformer encoder for predicting the summary. 


photorealistic text-to-image examples in Fig. 11.9.5 suggest that the T5 encoder alone may 
effectively represent text even without fine-tuning. 


Teddy bears swimming at the Olympics 400m Butter- A cute corgi lives in a house made out of sushi. A cute sloth holding a small treasure chest. A bright 
fly event. golden glow is coming from the chest. 


Text-to-image examples by the Imagen model, whose text encoder is from T5 (figures 
taken from Saharia et al. (2022)). 


11.9.3 Decoder-Only 


We have reviewed encoder-only and encoder—decoder Transformers. Alternatively, decoder- 
only Transformers remove the entire encoder and the decoder sublayer with the encoder- 
decoder cross-attention from the original encoder—decoder architecture depicted in Fig. 
11.7.1. Nowadays, decoder-only Transformers have been the de facto architecture in large- 
scale language modeling (Section 9.3), which leverages the world’s abundant unlabeled text 
corpora via self-supervised learning. 


GPT and GPT-2 


Using language modeling as the training objective, the GPT (generative pre-training) model 
chooses a Transformer decoder as its backbone (Radford et al., 2018). 


Following the autoregressive language model training as described in Section 9.3.3, Fig. 
11.9.6 illustrates GPT pretraining with a Transformer encoder, where the target sequence is 
the input sequence shifted by one token. Note that the attention pattern in the Transformer 
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like this book <eos> 


Transformer decoder 


Attention output 
= 
a 
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T T T T T 
<bos> | like this book <bos> | like this book 


Attention input 
Left: Pretraining GPT with language modeling. The target sequence is the input sequence 
shifted by one token. Both “<bos>” and “<eos>” are special tokens marking the 
beginning and end of sequences, respectively. Right: Attention pattern in the Transformer 
decoder. Each token along the vertical axis attends to only its past tokens along the 
horizontal axis (causal). 


decoder enforces that each token can only attend to its past tokens (future tokens cannot be 
attended to because they have not yet been chosen). 


GPT has 100 million parameters and needs to be fine-tuned for individual downstream 
tasks. A much larger Transformer-decoder language model, GPT-2, was introduced one 
year later (Radford et al., 2019). Compared with the original Transformer decoder in GPT, 
pre-normalization (discussed in Section 11.8.3) and improved initialization and weight- 
scaling were adopted in GPT-2. Pretrained on 40 GB of text, the 1.5-billion-parameter GPT- 
2 obtained the state-of-the-art results on language modeling benchmarks and promising 
results on multiple other tasks without updating the parameters or architecture. 


GPT-3 and Beyond 


GPT-2 demonstrated potential of using the same language model for multiple tasks without 
updating the model. This is more computationally efficient than fine-tuning, which requires 
model updates via gradient computation. 


Before explaining the more computationally efficient use of language models without pa- 
rameter update, recall Section 9.5 that a language model can be trained to generate a text 
sequence conditional on some prefix text sequence. Thus, a pretrained language model may 
generate the task output as a sequence without parameter update, conditional on an input 
sequence with the task description, task-specific input-output examples, and a prompt (task 
input). This learning paradigm is called in-context learning (Brown et al., 2020), which 
can be further categorized into zero-shot, one-shot, and few-shot, when there is no, one, 
and a few task-specific input-output examples (Fig. 11.9.7). 


These three settings were tested in GPT-3 (Brown et al., 2020), whose largest version uses 
data and model size about two orders of magnitude larger than those in GPT-2. GPT-3 
uses the same Transformer decoder architecture as its direct predecessor GPT-2 except that 
attention patterns (at the right in Fig. 11.9.6) are sparser at alternating layers. Pretrained 
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Zero-shot One-shot 
Output je suis malade je suis malade 
Transformer decoder Transformer decoder 
(no parameter update) (no parameter update) 
Input Translate English to French: i'm home -> Translate English to French: go -> va | i'm home -> 
«+ ------------ > <---> ~+------------ > <--> <---> 
Task description Prompt Task description One Prompt 
example 
Few-shot 
Output je suis malade 


Transformer decoder 
(no parameter update) 


Input Translate English to French: go -> va | i lost -> j’ai perdu | he’s calm -> elle court | i'm home -> 
~«------------ > < —------------------------- > <---- 


> 
Task description A few examples Prompt 


Zero-shot, one-shot, few-shot in-context learning with language models (Transformer 
decoders). No parameter update is needed. 


400 Aggregate Performance Across Benchmarks 
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Aggregate performance of GPT-3 for all 42 accuracy-denominated benchmarks (caption 
adapted and figure taken from Brown et al. (2020)). 


with 300 billion tokens, GPT-3 performs better with larger model size, where few-shot 
performance increases most rapidly (Fig. 11.9.8). 


The subsequent GPT-4 model did not fully disclose technical details in its report (OpenAI, 
2023). By contrast with its predecessors, GPT-4 is a large-scale, multimodal model that 
can take both text and images as input and generate text output. 


11.9.4 Scalability 


Fig. 11.9.8 empirically demonstrates scalability of Transformers in the GPT-3 language 
model. For language modeling, more comprehensive empirical studies on the scalability 
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of Transformers have led researchers to see promise in training larger Transformers with 
more data and compute (Kaplan et al., 2020). 


4.2 
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Parameters 
non-embedding 


Dataset Size 
tokens 


Compute 
PF-days, non-embedding 
Transformer language model performance improves smoothly as we increase the model 
size, dataset size, and amount of compute used for training. For optimal performance all 
three factors must be scaled up in tandem. Empirical performance has a power-law 
relationship with each individual factor when not bottlenecked by the other two (caption 
adapted and figure taken from Kaplan et al. (2020)). 


As shown in Fig. 11.9.9, power-law scaling can be observed in the performance with re- 
spect to the model size (number of parameters, excluding embedding layers), dataset size 
(number of training tokens), and amount of training compute (PetaFLOP/s-days, excluding 
embedding layers). In general, increasing all these three factors in tandem leads to better 
performance. However, how to increase them in tandem still remains a matter of debate 
(Hoffmann et al., 2022). 


Larger models require fewer samples The optimal model size grows smoothly 


to reach the same performance with the loss target and compute budget 


Line color indicates 


Test Loss 10 number of parameters 


— Compute-efficient 
training stops far 
short of convergence 


109 Params 


T T T 
10-9 108 10° 


Tokens Processed Compute (PF-days) 


Transformer language model training runs (figure taken from Kaplan et al. (2020)). 


As well as increased performance, large models also enjoy better sample efficiency than 
small models. Fig. 11.9.10 shows that large models need fewer training samples (tokens 
processed) to perform at the same level achieved by small models, and performance is 
scaled smoothly with compute. 


The empirical scaling behaviors in Kaplan et al. (2020) have been tested in subsequent 
large Transformer models. For example, GPT-3 supported this hypothesis with two more 
orders of magnitude in Fig. 11.9.11. 
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Validation Loss 


b= 2.57 -C-0.048 


ro) 3 ro) 3 3 ro) 
> S ‘co o 2 a 
Parameters 


10° 10° 10° 10° 10° 10 
Compute (PetaFLOP/s-days) 
GPT-3 performance (cross-entropy validation loss) follows a power-law trend with the 
amount of compute used for training. The power-law behavior observed in Kaplan et al. 
(2020) continues for an additional two orders of magnitude with only small deviations 
from the predicted curve. Embedding parameters are excluded from compute and 
parameter counts (caption adapted and figure taken from Brown et al. (2020)). 


11.9.5 Large Language Models 


The scalability of Transformers in the GPT series has inspired subsequent large language 
models. The GPT-2 Transformer decoder was used for training the 530-billion-parameter 
Megatron-Turing NLG (Smith et al., 2022) with 270 billion training tokens. Following 
the GPT-2 design, the 280-billion-parameter Gopher (Rae et al., 2021) pretrained with 300 
billion tokens, performed competitively across diverse tasks. Inheriting the same architec- 
ture and using the same compute budget of Gopher, Chinchilla (Hoffmann et al., 2022) 
is a substantially smaller (70 billion parameters) model that trains for much longer (1.4 
trillion training tokens), outperforming Gopher on many tasks and with more emphasis on 
the number of tokens than on the number of parameters. To continue the scaling line of 
language modeling, PaLM (Pathway Language Model) (Chowdhery et al., 2022), a 540- 
billion-parameter Transformer decoder with modified designs pretrained on 780 billion to- 
kens, outperformed average human performance on the BIG-Bench benchmark (Srivastava 
et al., 2022). Its later version, PaLM 2 (Anil et al., 2023), scaled data and model roughly 1:1 
and improved multilingual and reasoning capabilities. Other large language models, such 
as Minerva (Lewkowycz et al., 2022) that further trains a generalist (PaLM) and Galac- 
tica (Taylor et al., 2022) that is not trained on a general corpus, have shown promising 
quantitative and scientific reasoning capabilities. 


Open-sourced releases, such as OPT (Open Pretrained Transformers) (Zhang et al., 2022), 
BLOOM (Scao et al., 2022), and FALCON (Penedo et al., 2023), democratized research 
and use of large language models. Focusing on computational efficiency at inference time, 
the open-sourced Llama 1 (Touvron et al., 2023a) outperformed much larger models by 
training on more tokens than had been typically used. The updated Llama 2 (Touvron et 
al., 2023b) further increased the pretraining corpus by 40%, leading to product models that 
may match the performance of competitive close-sourced models. 


466 


Attention Mechanisms and Transformers 


Wei et al. (2022) discussed emergent abilities of large language models that are present in 
larger models, but not in smaller models. However, simply increasing model size does not 
inherently make models follow human instructions better. Sanh et al. (2021), Wei et al. 
(2021) have found that fine-tuning large language models on a range of datasets described 
via instructions can improve zero-shot performance on held-out tasks. Using reinforcement 
learning from human feedback, Ouyang et al. (2022) fine-tuned GPT-3 to follow a diverse 
set of instructions. Following the resultant InstructGPT which aligns language models with 
human intent via fine-tuning (Ouyang et al., 2022), ChatGPT 164 can generate human-like 
responses (e.g., code debugging and creative writing) based on conversations with humans 
and can perform many natural language processing tasks zero-shot (Qin et al., 2023). Bai et 
al. (2022) replaced human inputs (e.g., human-labeled data) with model outputs to partially 
automate the instruction tuning process, which is also known as reinforcement learning 
from AI feedback. 


Large language models offer an exciting prospect of formulating text input to induce models 
to perform desired tasks via in-context learning, which is also known as prompting. No- 
tably, chain-of-thought prompting (Wei et al., 2022), an in-context learning method with 
few-shot “question, intermediate reasoning steps, answer” demonstrations, elicits the com- 
plex reasoning capabilities of large language models in order to solve mathematical, com- 
monsense, and symbolic reasoning tasks. Sampling multiple reasoning paths (Wang et al., 
2023), diversifying few-shot demonstrations (Zhang et al., 2023), and reducing complex 
problems to sub-problems (Zhou et al., 2023) can all improve the reasoning accuracy. In 
fact, with simple prompts like “Let’s think step by step” just before each answer, large lan- 
guage models can even perform zero-shot chain-of-thought reasoning with decent accuracy 
(Kojima et al., 2022). Even for multimodal inputs consisting of both text and images, lan- 
guage models can perform multimodal chain-of-thought reasoning with higher accuracy 
than using text input only (Zhang et al., 2023). 


11.9.6 Summary and Discussion 


Transformers have been pretrained as encoder-only (e.g., BERT), encoder—decoder (e.g., 
T5), and decoder-only (e.g., GPT series). Pretrained models may be adapted to perform 
different tasks with model update (e.g., fine-tuning) or not (e.g., few-shot). Scalability of 
Transformers suggests that better performance benefits from larger models, more training 
data, and more training compute. Since Transformers were first designed and pretrained 
for text data, this section leans slightly towards natural language processing. Nonetheless, 
those models discussed above can be often found in more recent models across multiple 
modalities. For example, (i) Chinchilla (Hoffmann et al., 2022) was further extended to 
Flamingo (Alayrac et al., 2022), a visual language model for few-shot learning; (ii) GPT-2 
(Radford et al., 2019) and the vision Transformer encode text and images in CLIP (Con- 
trastive Language-Image Pre-training) (Radford et al., 2021), whose image and text em- 
beddings were later adopted in the DALL-E 2 text-to-image system (Ramesh et al., 2022). 
Although there have been no systematic studies on Transformer scalability in multimodal 
pretraining yet, an all-Transformer text-to-image model called Parti (Yu et al., 2022) shows 
potential of scalability across modalities: a larger Parti is more capable of high-fidelity 
image generation and content-rich text understanding (Fig. 11.9.12). 
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Parti-350M Parti-750M Parti-3B Parti-20B 


A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass 
in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends! 


isu O A Image examples generated from the same text by the Parti model of increasing sizes 
(350M, 750M, 3B, 20B) (examples taken from Yu et al. (2022)). 


11.9.7 Exercises 


1. Is it possible to fine-tune T5 using a minibatch consisting of different tasks? Why or 
why not? How about for GPT-2? 


2. Given a powerful language model, what applications can you think of? 


3. Say that you are asked to fine-tune a language model to perform text classification by 
adding additional layers. Where will you add them? Why? 


4. Consider sequence-to-sequence problems (e.g., machine translation) where the input 
sequence is always available throughout the target sequence prediction. What could be 
limitations of modeling with decoder-only Transformers? Why? 


Discussions 165, 


165 
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If you read the book in sequence up to this point you already used a number of optimization 
algorithms to train deep learning models. They were the tools that allowed us to continue 
updating model parameters and to minimize the value of the loss function, as evaluated on 
the training set. Indeed, anyone content with treating optimization as a black box device 
to minimize objective functions in a simple setting might well content oneself with the 
knowledge that there exists an array of incantations of such a procedure (with names such 
as “SGD” and “Adam’’). 


To do well, however, some deeper knowledge is required. Optimization algorithms are 
important for deep learning. On the one hand, training a complex deep learning model can 
take hours, days, or even weeks. The performance of the optimization algorithm directly 
affects the model’s training efficiency. On the other hand, understanding the principles 
of different optimization algorithms and the role of their hyperparameters will enable us to 
tune the hyperparameters in a targeted manner to improve the performance of deep learning 
models. 


In this chapter, we explore common deep learning optimization algorithms in depth. Al- 
most all optimization problems arising in deep learning are nonconvex. Nonetheless, the 
design and analysis of algorithms in the context of convex problems have proven to be very 
instructive. It is for that reason that this chapter includes a primer on convex optimization 
and the proof for a very simple stochastic gradient descent algorithm on a convex objective 
function. 


12.1 Optimization and Deep Learning 


In this section, we will discuss the relationship between optimization and deep learning as 
well as the challenges of using optimization in deep learning. For a deep learning problem, 
we will usually define a loss function first. Once we have the loss function, we can use an 
optimization algorithm in attempt to minimize the loss. In optimization, a loss function is 
often referred to as the objective function of the optimization problem. By tradition and con- 
vention most optimization algorithms are concerned with minimization. If we ever need to 
maximize an objective there is a simple solution: just flip the sign on the objective. 
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12.1.1 Goal of Optimization 


Although optimization provides a way to minimize the loss function for deep learning, 
in essence, the goals of optimization and deep learning are fundamentally different. The 
former is primarily concerned with minimizing an objective whereas the latter is concerned 
with finding a suitable model, given a finite amount of data. In Section 3.6, we discussed the 
difference between these two goals in detail. For instance, training error and generalization 
error generally differ: since the objective function of the optimization algorithm is usually a 
loss function based on the training dataset, the goal of optimization is to reduce the training 
error. However, the goal of deep learning (or more broadly, statistical inference) is to reduce 
the generalization error. To accomplish the latter we need to pay attention to overfitting in 
addition to using the optimization algorithm to reduce the training error. 


%matplotlib inline 

import numpy as np 

import torch 

from mpl_toolkits import mplot3d 
from d21 import torch as d21 


To illustrate the aforementioned different goals, let’s consider the empirical risk and the 
risk. As described in Section 4.7.3, the empirical risk is an average loss on the training 
dataset while the risk is the expected loss on the entire population of data. Below we define 
two functions: the risk function f and the empirical risk function g. Suppose that we have 
only a finite amount of training data. As a result, here g is less smooth than f. 


def f(x): 
return x * torch.cos(np.pi * x) 


def g(x): 
return f(x) + @.2 * torch.cos(5 * np.pi * x) 


The graph below illustrates that the minimum of the empirical risk on a training dataset 
may be at a different location from the minimum of the risk (generalization error). 


def annotate(text, xy, xytext): #@save 
d21.plt.gca() .annotate(text, xy=xy, xytext=xytext, 
arrowprops=dict (arrowstyle='->')) 


x = torch.arange(@.5, 1.5, 0.01) 

d21.set_figsize((4.5, 2.5)) 

d21sploteG ECO BCO 2x2. risk») 

annotate('min of\nempirical risk’, (1.0, -1.2), (@.5, -1.1)) 
annotate('min of risk’, (1.1, -1.05), (0.95, -@.5)) 


12.1.2 Optimization Challenges in Deep Learning 


In this chapter, we are going to focus specifically on the performance of optimization algo- 
rithms in minimizing the objective function, rather than a model’s generalization error. In 
Section 3.1 we distinguished between analytical solutions and numerical solutions in opti- 
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mization problems. In deep learning, most objective functions are complicated and do not 
have analytical solutions. Instead, we must use numerical optimization algorithms. The 
optimization algorithms in this chapter all fall into this category. 


There are many challenges in deep learning optimization. Some of the most vexing ones 
are local minima, saddle points, and vanishing gradients. Let’s have a look at them. 


Local Minima 


For any objective function f(x), if the value of f(x) at x is smaller than the values of f(x) 
at any other points in the vicinity of x, then f(x) could be a local minimum. If the value 
of f(x) at x is the minimum of the objective function over the entire domain, then f(x) is 
the global minimum. 


For example, given the function 
f(x) =x-cos(sx) for — 1.0 < x < 2.0, (12.1.1) 


we can approximate the local minimum and global minimum of this function. 


x = torch.arange(-1.0, 2.0, 0.01) 

call jolt, (PGS), J, 7x’ 5 “FOO 

annotate('local minimum’, (-@.3, -@.25), (-@.77, -1.0)) 
annotate('global minimum’, (1.1, -@.95), (0.6, @.8)) 


global minimum 


f(x) 


\ 


-14 local minimum 


The objective function of deep learning models usually has many local optima. When the 
numerical solution of an optimization problem is near the local optimum, the numerical 
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solution obtained by the final iteration may only minimize the objective function locally, 
rather than globally, as the gradient of the objective function’s solutions approaches or 
becomes zero. Only some degree of noise might knock the parameter out of the local 
minimum. In fact, this is one of the beneficial properties of minibatch stochastic gradient 
descent where the natural variation of gradients over minibatches is able to dislodge the 
parameters from local minima. 


Saddle Points 


Besides local minima, saddle points are another reason for gradients to vanish. A saddle 
point is any location where all gradients of a function vanish but which is neither a global 
nor a local minimum. Consider the function f(x) = x°. Its first and second derivative van- 
ish for x = 0. Optimization might stall at this point, even though it is not a minimum. 


x = torch.arange(-2.0, 2.0, 0.01) 
d2I oll, |bexssl, a GO 
annotate('saddle point', (0, -0.2), (-0.52, -5.0)) 


\ 


-54 saddle point 


Saddle points in higher dimensions are even more insidious, as the example below shows. 
Consider the function f(x, y) = x? — y?. It has its saddle point at (0, 0). This is a maximum 
with respect to y and a minimum with respect to x. Moreover, it looks like a saddle, which 
is where this mathematical property got its name. 


x, y = torch.meshgrid( 
torch. linspace(-1.0, 1.0, 101), torch.linspace(-1.0, 1.0, 101)) 
2 = Eee = yx*xx2 


ax = d2l.plt.figure().add_subplot(111, projection='3d') 
ax.plot_wireframe(x, y, z, **{'rstride’: 10, 'cstride’: 10}) 
ax.plot([@], [0], [0], ‘'rx') 

ticks = [=1, 0, 1] 

d21.plt.xticks(ticks) 

d21.plt.yticks(ticks) 

ax.set_zticks(ticks) 

d21.plt.xlabel('x’) 

d21.plt.ylabel('y'); 


We assume that the input of a function is a k-dimensional vector and its output is a scalar, 
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so its Hessian matrix will have k eigenvalues. The solution of the function could be a local 
minimum, a local maximum, or a saddle point at a position where the function gradient is 
Zero: 


e When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are 
all positive, we have a local minimum for the function. 


e When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are 
all negative, we have a local maximum for the function. 


e When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are 
negative and positive, we have a saddle point for the function. 


For high-dimensional problems the likelihood that at least some of the eigenvalues are neg- 
ative is quite high. This makes saddle points more likely than local minima. We will discuss 
some exceptions to this situation in the next section when introducing convexity. In short, 
convex functions are those where the eigenvalues of the Hessian are never negative. Sadly, 
though, most deep learning problems do not fall into this category. Nonetheless it is a great 
tool to study optimization algorithms. 


Vanishing Gradients 


Probably the most insidious problem to encounter is the vanishing gradient. Recall our 
commonly-used activation functions and their derivatives in Section 5.1.2. For instance, 
assume that we want to minimize the function f(x) = tanh(x) and we happen to get started 
at x = 4. As we can see, the gradient of f is close to nil. More specifically, f’(x) = 
1 — tanh? (x) and thus f’(4) = 0.0013. Consequently, optimization will get stuck for a 
long time before we make progress. This turns out to be one of the reasons that training 
deep learning models was quite tricky prior to the introduction of the ReLU activation 
function. 


x = torch.arange(-2.0, 5.0, 0.01) 
d21.plot(x, [torch.tanh(x)], 'x’, 'f(x)') 
annotate('vanishing gradient’, (4, 1), (2, 0.0)) 


AS we saw, optimization for deep learning is full of challenges. Fortunately there exists a 
robust range of algorithms that perform well and that are easy to use even for beginners. 
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Furthermore, it is not really necessary to find the best solution. Local optima or even ap- 
proximate solutions thereof are still very useful. 


12.1.3 Summary 


e Minimizing the training error does not guarantee that we find the best set of parameters 
to minimize the generalization error. 


e The optimization problems may have many local minima. 
e The problem may have even more saddle points, as generally the problems are not convex. 


e Vanishing gradients can cause optimization to stall. Often a reparametrization of the 
problem helps. Good initialization of the parameters can be beneficial, too. 


12.1.4 Exercises 


1. Consider a simple MLP with a single hidden layer of, say, d dimensions in the hid- 
den layer and a single output. Show that for any local minimum there are at least d! 
equivalent solutions that behave identically. 


2. Assume that we have a symmetric random matrix M where the entries M;; = Mji are 
each drawn from some probability distribution p;;. Furthermore assume that p;;(x) = 
Pij(—x), i.e., that the distribution is symmetric (see e.g., Wigner (1958) for details). 


1. Prove that the distribution over eigenvalues is also symmetric. That is, for any eigen- 
vector v the probability that the associated eigenvalue A satisfies P(A > 0) = P(A < 
0). 


2. Why does the above not imply P(A > 0) = 0.5? 
3. What other challenges involved in deep learning optimization can you think of? 
4. Assume that you want to balance a (real) ball on a (real) saddle. 

1. Why is this hard? 


166 ee TERES ; 
iad 2. Can you exploit this effect also for optimization algorithms? 


a Discussions 166 , 
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12.2 Convexity 
LL SSS al 


Convexity plays a vital role in the design of optimization algorithms. This is largely due 
to the fact that it is much easier to analyze and test algorithms in such a context. In other 
words, if the algorithm performs poorly even in the convex setting, typically we should not 
hope to see great results otherwise. Furthermore, even though the optimization problems in 
deep learning are generally nonconvex, they often exhibit some properties of convex ones 
near local minima. This can lead to exciting new optimization variants such as (Izmailov 
et al., 2018). 


%matplotlib inline 

import numpy as np 

import torch 

from mpl_toolkits import mplot3d 
from d21 import torch as d21 


12.2.1 Definitions 


Before convex analysis, we need to define convex sets and convex functions. They lead to 
mathematical tools that are commonly applied to machine learning. 


Convex Sets 


Sets are the basis of convexity. Simply put, a set X in a vector space is convex if for any 
a,b € X the line segment connecting a and b is also in X. In mathematical terms this 
means that for all A € [0, 1] we have 


Aa + (1 —4)b € X whenever a,b € X. (12.2.1) 


This sounds a bit abstract. Consider Fig. 12.2.1. The first set is not convex since there exist 
line segments that are not contained in it. The other two sets suffer no such problem. 


The first set is nonconvex and the other two are convex. 


Definitions on their own are not particularly useful unless you can do something with them. 
In this case we can look at intersections as shown in Fig. 12.2.2. Assume that X and Y are 
convex sets. Then X N Y is also convex. To see this, consider any a,b € X N Y. Since X 
and Y are convex, the line segments connecting a and b are contained in both X and Y. 
Given that, they also need to be contained in X N Y, thus proving our theorem. 
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Convexity 


The intersection between two convex sets is convex. 


We can strengthen this result with little effort: given convex sets X;, their intersection N; X; 
is convex. To see that the converse is not true, consider two disjoint sets X N Y = 0. Now 
pick a € X and b € Y. The line segment in Fig. 12.2.3 connecting a and b needs to contain 
some part that is neither in X nor in Y, since we assumed that X N Y = Ø. Hence the line 
segment is not in X U Y either, thus proving that in general unions of convex sets need not 
be convex. 


The union of two convex sets need not be convex. 


Typically the problems in deep learning are defined on convex sets. For instance, R, the 
set of d-dimensional vectors of real numbers, is a convex set (after all, the line between any 
two points in Rf remains in R). In some cases we work with variables of bounded length, 
such as balls of radius r as defined by {x|x € Rf and ||x|| < r}. 


Convex Functions 


Now that we have convex sets we can introduce convex functions f. Given a convex set X, 
a function f : X — Ris convex if for all x, x’ € X and for all A € [0,1] we have 


Af (x) + -Af > f(axt (1 —a)x’). (12.2.2) 


To illustrate this let’s plot a few functions and check which ones satisfy the requirement. 
Below we define a few functions, both convex and nonconvex. 


f = lambda x: 0.5 * x*xx2 # Convex 


g = lambda x: torch.cos(np.pi * x) # Nonconvex 
h = lambda x: torch.exp(@.5 * x) # Convex 
x, segment = torch.arange(-2, 2, @.01), torch.tensor([-1.5, 1]) 


d21.use_svg_display() 
_, axes = d21.plt.subplots(1, 3, figsize=(9, 3)) 
for ax, func in zip(axes, [f, g, h]): 
d21.plot([x, segment], Lfunc(x), func(segment)], axes=ax) 
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2.54 


2.04 


155 


1.05 


0.55 


As expected, the cosine function is nonconvex, whereas the parabola and the exponential 
function are. Note that the requirement that X is a convex set is necessary for the condition 
to make sense. Otherwise the outcome of f (Ax+(1—A)x’) might not be well defined. 


Jensen’s Inequality 


Given a convex function f, one of the most useful mathematical tools is Jensen’s inequality. 
It amounts to a generalization of the definition of convexity: 


Sae [> vs and Ex[f(X)] > f (Ex[X]), (12.2.3) 


where a; are nonnegative real numbers such that >}; œ; = 1 and X is a random variable. In 
other words, the expectation of a convex function is no less than the convex function of an 
expectation, where the latter is usually a simpler expression. To prove the first inequality 
we repeatedly apply the definition of convexity to one term in the sum at a time. 


One of the common applications of Jensen’s inequality is to bound a more complicated 
expression by a simpler one. For example, its application can be with regard to the log- 
likelihood of partially observed random variables. That is, we use 


Ey~pcy)[—log P(X | Y)] > -log P(X), (12.2.4) 


since f P(Y)P(X | Y)dY = P(X). This can be used in variational methods. Here Y 
is typically the unobserved random variable, P(Y) is the best guess of how it might be 
distributed, and P(X) is the distribution with Y integrated out. For instance, in clustering 
Y might be the cluster labels and P(X | Y) is the generative model when applying cluster 
labels. 


12.2.2 Properties 


Convex functions have many useful properties. We describe a few commonly-used ones 
below. 
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Local Minima Are Global Minima 


First and foremost, the local minima of convex functions are also the global minima. We 
can prove it by contradiction as follows. 


Consider a convex function f defined on a convex set X. Suppose that x* € X is a local 
minimum: there exists a small positive value p so that for x € X that satisfies 0 < |x—x*| < 


p we have f(x*) < f(x). 


Assume that the local minimum x* is not the global minimum of f: there exists x’ € X 


for which f(x’) < f(x*). There also exists A € [0,1) such as 4 = 1 - orl so that 
0 < jAx* + (1 -A)xX’ -x*| < p. 
However, according to the definition of convex functions, we have 
f(Ax* + (1 -A)x’) < AF(X*) + 1 - AYO’) 
< Af (x*)+(1-aA)f(x*) (12.2.5) 


= f(x"), 


which contradicts with our statement that x* is a local minimum. Therefore, there does 
not exist x’ € X for which f(x’) < f(x*). The local minimum x* is also the global 
minimum. 


For instance, the convex function f(x) = (x — 1)? has a local minimum at x = 1, which is 
also the global minimum. 


f = lambda x: (x - 1) ** 2 
d21.set_figsize() 
d21.plot([x, segment], Lf(x), f(segment)], 'x', 'f(x)') 


The fact that the local minima for convex functions are also the global minima is very 
convenient. It means that if we minimize functions we cannot “get stuck”. Note, though, 
that this does not mean that there cannot be more than one global minimum or that there 
might even exist one. For instance, the function f(x) = max(|x|— 1,0) attains its minimum 
value over the interval [—1, 1]. Conversely, the function f(x) = exp(x) does not attain a 
minimum value on R: for x — —co it asymptotes to 0, but there is no x for which f(x) = 
0. 
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Below Sets of Convex Functions Are Convex 


We can conveniently define convex sets via below sets of convex functions. Concretely, 
given a convex function f defined on a convex set X, any below set 


Sp = {x|x € X and f(x) < b} (12.2.6) 
is convex. 


Let’s prove this quickly. Recall that for any x, x’ € Sp we need to show that Ax+ (1—-A)x’ € 
Sp as long as A € [0, 1]. Since f(x) < b and f(x’) < b, by the definition of convexity we 
have 


f(Ax + (1—A)x’) < Af (x) + (1- A) f(x’) <b. (12.2.7) 


Convexity and Second Derivatives 


Whenever the second derivative of a function f : R” — R exists it is very easy to check 
whether f is convex. All we need to do is check whether the Hessian of f is positive 
semidefinite: V7 f > 0, i.e., denoting the Hessian matrix V? f by H, x'Hx > 0 for all 
x € R”. For instance, the function f(x) = zixl? is convex since V? f = 1, i.e., its Hessian 
is an identity matrix. 


Formally, a twice-differentiable one-dimensional function f : R — Ris convex if and only 
if its second derivative f” > 0. For any twice-differentiable multidimensional function 
f: R” SR, itis convex if and only if its Hessian V? f > 0. 


First, we need to prove the one-dimensional case. To see that convexity of f implies f” > 0 
we use the fact that 


X+E TARE 


Isere + Tæ- 26(5 Jere: (12.2.8) 
Since the second derivative is given by the limit over finite differences it follows that 


” m Lett fe~ 6) ~2f(%) , 

f' (a= m= a (12.2.9) 

To see that f” > 0 implies that f is convex we use the fact that f” > 0 implies that f’ 

is a monotonically nondecreasing function. Leta < x < b be three points in R, where 
= (1 -—A)a+Ab and A e (0,1). According to the mean value theorem, there exist 


a € [a,x] and £ € [x, b] such that 


f'(a) = ay) and f’(B) = ee (12.2.10) 
By monotonicity f’(B) 2 f’ T hence 
— “ F(b) + ? —* fa) > f(x). (12.2.11) 
= = 


Since x = (1 — A)a + Ab, we have 


Af(b) +(1-A) f(a) > f((1-A)a + Ab), (12.2.12) 
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thus proving convexity. 


Second, we need a lemma before proving the multidimensional case: f : R” — Ris convex 
if and only if for all x, y € R” 


8(z) g f(zx+(1-z)y) where z € [0,1] (12.2.13) 
is convex. 


To prove that convexity of f implies that g is convex, we can show that for alla, b, A € [0, 1] 
(thus O < ła + (1 -4)b < 1) 


g(a + (1 -4)b) 
=f (Aa+ (1 - 4)b)x + (1 -Aa- (1 - 4)b) y) 
=f (A(ax+ (1 - a)y) + (1-14) (bx + (1 - b)y)) (12.2.14) 
<Af (ax+(l-a)y)+(-A)f (bx + (1 - b)y) 
=Ag(a) + (1 — A)g(d). 
To prove the converse, we can show that for all A € [0, 1] 
f(ax+(1-A)y) 
=g(A-14+(1-A)-0) 
<Ag(1) + (1 — 4)g (0) 
=Af(x) + -Af (y). 


(12.2.15) 


Finally, using the lemma above and the result of the one-dimensional case, the multidimen- 
sional case can be proven as follows. A multidimensional function f : R” — R is convex 
if and only if for all x, y € R” g(z) = f(zx+ (1 - z)y), where z € [0, 1], is convex. Ac- 


cording to the one-dimensional case, this holds if and only if g” = (x —y)"H(x- y) > 0 


(H f y? f) for all x,y € R”, which is equivalent to H > 0 per the definition of positive 


semidefinite matrices. 


12.2.3 Constraints 


One of the nice properties of convex optimization is that it allows us to handle constraints ef- 
ficiently. That is, it allows us to solve constrained optimization problems of the form: 


minimize f(x) 
Be (12.2.16) 
subject to c;(x) < 0 for alli € {1,...,n}, 


where f is the objective and the functions c; are constraint functions. To see what this does 
consider the case where c;(x) = ||x|lz — 1. In this case the parameters x are constrained to 
the unit ball. If a second constraint is c2(x) = v'x +b, then this corresponds to all x lying 
on a half-space. Satisfying both constraints simultaneously amounts to selecting a slice of 
a ball. 
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Lagrangian 


In general, solving a constrained optimization problem is difficult. One way of addressing 
it stems from physics with a rather simple intuition. Imagine a ball inside a box. The ball 
will roll to the place that is lowest and the forces of gravity will be balanced out with the 
forces that the sides of the box can impose on the ball. In short, the gradient of the objective 
function (i.e., gravity) will be offset by the gradient of the constraint function (the ball need 
to remain inside the box by virtue of the walls “pushing back”). Note that some constraints 
may not be active: the walls that are not touched by the ball will not be able to exert any 
force on the ball. 


Skipping over the derivation of the Lagrangian L, the above reasoning can be expressed 
via the following saddle point optimization problem: 


L(x, @1,...,@n) = f(x) + D a;c;(x) where a; > 0. (12.2.17) 


i=l 


Here the variables a; (i = 1,...,n) are the so-called Lagrange multipliers that ensure that 
constraints are properly enforced. They are chosen just large enough to ensure that c;(x) < 
0 for all i. For instance, for any x where c;(x) < 0 naturally, we’d end up picking a; = 0. 
Moreover, this is a saddle point optimization problem where one wants to maximize L with 
respect to all a; and simultaneously minimize it with respect to x. There is a rich body of 
literature explaining how to arrive at the function L(x, a1,...,@,). For our purposes it is 
sufficient to know that the saddle point of L is where the original constrained optimization 
problem is solved optimally. 


Penalties 


One way of satisfying constrained optimization problems at least approximately is to adapt 
the Lagrangian L. Rather than satisfying c;(x) < 0 we simply add a;c;(x) to the objective 
function f(x). This ensures that the constraints will not be violated too badly. 


In fact, we have been using this trick all along. Consider weight decay in Section 3.7. In it 
we add 4 || w||* to the objective function to ensure that w does not grow too large. From the 
constrained optimization point of view we can see that this will ensure that ||w||? — r? < 0 
for some radius r. Adjusting the value of 4 allows us to vary the size of w. 


In general, adding penalties is a good way of ensuring approximate constraint satisfaction. 
In practice this turns out to be much more robust than exact satisfaction. Furthermore, for 
nonconvex problems many of the properties that make the exact approach so appealing in 
the convex case (e.g., optimality) no longer hold. 


Projections 


An alternative strategy for satisfying constraints is projections. Again, we encountered 
them before, e.g., when dealing with gradient clipping in Section 9.5. There we ensured 


481 


Convexity 


that a gradient has length bounded by 0 via 
g — g-min(1,0/||g|l). (12.2.18) 


This turns out to be a projection of g onto the ball of radius 6. More generally, a projection 
on a convex set X is defined as 


Proj (x) = argmin ||x — x’ ||, (12.2.19) 


x/EX 


which is the closest point in X to x. 


Convex Projections. 


The mathematical definition of projections may sound a bit abstract. Fig. 12.2.4 explains it 
somewhat more clearly. In it we have two convex sets, a circle and a diamond. Points inside 
both sets (yellow) remain unchanged during projections. Points outside both sets (black) 
are projected to the points inside the sets (red) that are closet to the original points (black). 
While for £2 balls this leaves the direction unchanged, this need not be the case in general, 
as can be seen in the case of the diamond. 


One of the uses for convex projections is to compute sparse weight vectors. In this case we 
project weight vectors onto an £f; ball, which is a generalized version of the diamond case 
in Fig. 12.2.4. 


12.2.4 Summary 


In the context of deep learning the main purpose of convex functions is to motivate opti- 
mization algorithms and help us understand them in detail. In the following we will see 
how gradient descent and stochastic gradient descent can be derived accordingly. 


e Intersections of convex sets are convex. Unions are not. 


The expectation of a convex function is no less than the convex function of an expectation 
(Jensen’s inequality). 


A twice-differentiable function is convex if and only if its Hessian (a matrix of second 
derivatives) is positive semidefinite. 


e Convex constraints can be added via the Lagrangian. In practice we may simply add 
them with a penalty to the objective function. 


Projections map to points in the convex set closest to the original points. 
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12.2.5 Exercises 


1. Assume that we want to verify convexity of a set by drawing all lines between points 
within the set and checking whether the lines are contained. 


1. Prove that it is sufficient to check only the points on the boundary. 


2. Prove that it is sufficient to check only the vertices of the set. 


2. Denote by 8, [r] = {x|x € R? and |x|l» < r} the ball of radius r using the p-norm. 
Prove that 8, [r] is convex for all p > 1. 


3. Given convex functions f and g, show that max(f, g) is convex, too. Prove that min( f, g) 
is not convex. 


4. Prove that the normalization of the softmax function is convex. More specifically prove 
the convexity of f(x) = log >}; exp(x;). 


5. Prove that linear subspaces, i.e., X = {x|Wx = b}, are convex sets. 


6. Prove that in the case of linear subspaces with b = 0 the projection Proj y can be written 
as Mx for some matrix M. 


7. Show that for twice-differentiable convex functions f we can write f(x +€) = f(x) + 
ef’ (x) + 5 f(x + €) for some é € [0, €]. 


8. Given a convex set X and two vectors x and y, prove that projections never increase 
distances, i.e., ||x — y|| > ||Proj x(x) — Proj x (y)ll. 
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12.3 Gradient Descent 
C) 


In this section we are going to introduce the basic concepts underlying gradient descent. 
Although it is rarely used directly in deep learning, an understanding of gradient descent is 
key to understanding stochastic gradient descent algorithms. For instance, the optimization 
problem might diverge due to an overly large learning rate. This phenomenon can already 
be seen in gradient descent. Likewise, preconditioning is a common technique in gradient 
descent and carries over to more advanced algorithms. Let’s start with a simple special 
case. 


12.3.1 One-Dimensional Gradient Descent 


Gradient descent in one dimension is an excellent example to explain why the gradient 
descent algorithm may reduce the value of the objective function. Consider some con- 
tinuously differentiable real-valued function f : R —> R. Using a Taylor expansion we 
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obtain 


f(x+e) = f(x) +ef' (x) + O(e’). (12.3.1) 


That is, in first-order approximation f(x + €) is given by the function value f(x) and the 
first derivative f’(x) at x. It is not unreasonable to assume that for small € moving in the 
direction of the negative gradient will decrease f. To keep things simple we pick a fixed 
step size 7 > 0 and choose € = -n f’ (x). Plugging this into the Taylor expansion above we 
get 


Fæ -nf E) = f@) - nf? (x) + OP f(x). (12.3.2) 


If the derivative f’ (x) # 0 does not vanish we make progress since 7 f’? (x) > 0. Moreover, 
we can always choose ņ small enough for the higher-order terms to become irrelevant. 
Hence we arrive at 


Fa- nf a) & fo). (12.3.3) 
This means that, if we use 
x x-nf' (x) (12.3.4) 


to iterate x, the value of function f(x) might decline. Therefore, in gradient descent we first 
choose an initial value x and a constant 7 > 0 and then use them to continuously iterate x 
until the stop condition is reached, for example, when the magnitude of the gradient | f’ (x)| 
is small enough or the number of iterations has reached a certain value. 


For simplicity we choose the objective function f(x) = x? to illustrate how to implement 
gradient descent. Although we know that x = 0 is the solution to minimize f(x), we still 
use this simple function to observe how x changes. 


%matplotlib inline 

import numpy as np 

import torch 

from d21 import torch as d21 


def f(x): # Objective function 
return x *x* 2 


def f_grad(x): # Gradient (derivative) of the objective function 
return 2 * x 


Next, we use x = 10 as the initial value and assume 7 = 0.2. Using gradient descent to 
iterate x for 10 times we can see that, eventually, the value of x approaches the optimal 
solution. 


def gd(eta, f_grad): 
x = 10.0 
results = [x] 


(continues on next page) 
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(continued from previous page) 


for i in range(10): 
x -= eta x f_grad(x) 
results. append(float(x)) 
print(f'’epoch 10, x: {x:f}’) 
return results 


results = gd(@.2, f_grad) 


epoch 10, x: 0.060466 


The progress of optimizing over x can be plotted as follows. 


def show_trace(results, f): 
n = max(abs(min(results)), abs(max(results))) 
f_line = torch.arange(-n, n, 2.01) 
d21.set_figsize() 
d21.plot([f_line, results], [[f(x) for x in f_line], [ 
iGO) tor X wn) reswbesil, S GO’, re = “= “I)) 


show_trace(results, f) 


100 5 


f(x) 


—10 =5 0 5 10 


Learning Rate 


The learning rate ņ can be set by the algorithm designer. If we use a learning rate that is too 
small, it will cause x to update very slowly, requiring more iterations to get a better solu- 
tion. To show what happens in such a case, consider the progress in the same optimization 
problem for 7 = 0.05. As we can see, even after 10 steps we are still very far from the 
optimal solution. 


show_trace(gd(0.05, f_grad), f) 
epoch 10, x: 3.486784 


Conversely, if we use an excessively high learning rate, |7f’(x)| might be too large for 
the first-order Taylor expansion formula. That is, the term O(n? f’*(x)) in (12.3.2) might 
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f(x) 


become significant. In this case, we cannot guarantee that the iteration of x will be able to 
lower the value of f(x). For example, when we set the learning rate to 7 = 1.1, x overshoots 
the optimal solution x = 0 and gradually diverges. 


show_trace(gd(1.1, f_grad), f) 


epoch 10, x: 61.917364 


Local Minima 


To illustrate what happens for nonconvex functions consider the case of f(x) = x -cos(cx) 
for some constant c. This function has infinitely many local minima. Depending on our 
choice of the learning rate and depending on how well conditioned the problem is, we may 
end up with one of many solutions. The example below illustrates how an (unrealistically) 
high learning rate will lead to a poor local minimum. 


c = torch.tensor(@.15 x np.pi) 


def f(x): # Objective function 
return x * torch.cos(c * x) 


def f_grad(x): # Gradient of the objective function 
return torch.cos(c * x) - c * x x torch.sin(c x x) 


show_trace(gd(2, f_grad), f) 
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epoch 10, x: -1.528166 


12.3.2 Multivariate Gradient Descent 


Now that we have a better intuition of the univariate case, let’s consider the situation where 
x = [x1,x2,...,Xa]'. That is, the objective function f : R > R maps vectors into 
scalars. Correspondingly its gradient is multivariate, too. Itis a vector consisting of d 
partial derivatives: 


af af aT 


oie 12.3.5 
Ox, , Ox> , i Oxa ( ) 


Vf (x) = 


Each partial derivative element 0 f (x)/ðx; in the gradient indicates the rate of change of 
f at x with respect to the input x;. As before in the univariate case we can use the cor- 
responding Taylor approximation for multivariate functions to get some idea of what we 
should do. In particular, we have that 


f(x+e) = f(x) + €V f(x) + O(lel|’). (12.3.6) 


In other words, up to second-order terms in € the direction of steepest descent is given by the 
negative gradient —V f(x). Choosing a suitable learning rate 7 > 0 yields the prototypical 
gradient descent algorithm: 


x—x-7nVf(x). (12.3.7) 


To see how the algorithm behaves in practice let’s construct an objective function f(x) = 
x? + 2x2 with a two-dimensional vector x = [x1,x2]" as input and a scalar as output. The 
gradient is given by V f(x) = [2x1,4x2]". We will observe the trajectory of x by gradient 
descent from the initial position [—5, —2]. 


To begin with, we need two more helper functions. The first uses an update function and ap- 
plies it 20 times to the initial value. The second helper visualizes the trajectory of x. 


def train_2d(trainer, steps=20, f_grad=None): #@save 
"""Optimize a 2D objective function with a customized trainer. 
# `s1` and `s2` are internal state variables that will be used in Momentum, 
« adagrad, RMSProp 


nnn 


(continues on next page) 
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(continued from previous page) 


xl, x2, sl, s2 = -5, -2, 0, 2 
results = [(x1, x2)] 
for i in range(steps): 
if f_grad: 
xl, x2, sl, s2 = trainer(xl, x2, sl, s2, f_grad) 
else: 
x1, x2, sl, s2 = trainer(xl, x2, sl, s2) 
results.append((x1, x2)) 
preintGivepoch fit 1s xl: filoatG@Dati. x2 tiloat@2)ar} ») 
return results 


def show_trace_2d(f, results): #@save 

"""Show the trace of 2D variables during optimization. 

d21.set_figsize() 

d21.plt.plot(*zip(*results), '-0o', color='#ff7f0e') 

x1, x2 = torch.meshgrid(torch. arange(-5.5, 1.0, 0.1), 
torch.arange(-3.0, 1.0, 0.1), indexing='ij') 

d21.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4') 

d21.plt.xlabel('x1') 

d21.plt.ylabel('x2') 


nnn 


Next, we observe the trajectory of the optimization variable x for learning rate 7 = 0.1. 
We can see that after 20 steps the value of x approaches its minimum at [0, 0]. Progress is 
fairly well-behaved albeit rather slow. 


def f_2d(x1, x2): # Objective function 
return x1 ** 2 + 2 * x2 xx 2 


def f_2d_grad(x1, x2): # Gradient of the objective function 
return (2 x x1, 4 * x2) 


def gd_2d(x1, x2, s1, s2, f_grad): 
gl, g2 = f_grad(x1, x2) 
return (x1 - eta x gl, x2 - eta x g2, Q, 0) 


eta = 0.1 
show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad)) 


epoch 20, x1: -0.057646, x2: -0.000073 
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12.3.3 Adaptive Methods 


As we could see in Section 12.3.1, getting the learning rate 7 “just right” is tricky. If we 
pick it too small, we make little progress. If we pick it too large, the solution oscillates and 
in the worst case it might even diverge. What if we could determine 7 automatically or get 
rid of having to select a learning rate at all? Second-order methods that look not only at the 
value and gradient of the objective function but also at its curvature can help in this case. 
While these methods cannot be applied to deep learning directly due to the computational 
cost, they provide useful intuition into how to design advanced optimization algorithms 
that mimic many of the desirable properties of the algorithms outlined below. 


Newton’s Method 


Reviewing the Taylor expansion of some function f : R? — R there is no need to stop after 
the first term. In fact, we can write it as 
1 

f(x+e) = f(x) +e'Vf(x) + xe V fe + O(lell>). (12.3.8) 
To avoid cumbersome notation we define H ©! y? f(x) to be the Hessian of f, which is 
a d X d matrix. For small d and simple problems H is easy to compute. For deep neural 
networks, on the other hand, H may be prohibitively large, due to the cost of storing O (a°) 
entries. Furthermore it may be too expensive to compute via backpropagation. For now 
let’s ignore such considerations and look at what algorithm we would get. 


After all, the minimum of f satisfies V f = 0. Following calculus rules in Section 2.4.3, by 
taking derivatives of (12.3.8) with regard to € and ignoring higher-order terms we arrive 
at 


V f(x) + He = 0 and hence e = -H~'V f(x). (12.3.9) 
That is, we need to invert the Hessian H as part of the optimization problem. 


As a simple example, for f(x) = 5x? we have Vf(x) = x and H = 1. Hence for any x 
we obtain € = —x. In other words, a single step is sufficient to converge perfectly without 
the need for any adjustment! Alas, we got a bit lucky here: the Taylor expansion was exact 
since f (x+ €) = 5x? +ext le. 

Let’s see what happens in other problems. Given a convex hyperbolic cosine function 
f(x) = cosh(cx) for some constant c, we can see that the global minimum at x = 0 is 
reached after a few iterations. 


c = torch. tensor(@.5) 


def f(x): # Objective function 
return torch.cosh(c * x) 


def f_grad(x): # Gradient of the objective function 
return c * torch.sinh(c * x) 


(continues on next page) 
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(continued from previous page) 


def f_hess(x): # Hessian of the objective function 
return cx*2 * torch.cosh(c * x) 


def newton(eta=1): 

x = 10.0 

results = [x] 

for i in range(10): 
x -= eta x f_grad(x) / f_hess(x) 
results. append(float(x)) 

print(’epoch 10, x:"', x) 

return results 


show_trace(newton(), f) 


epoch 10, x: tensor(@.) 


Now let’s consider a nonconvex function, such as f(x) = xcos(cx) for some constant c. 
After all, note that in Newton’s method we end up dividing by the Hessian. This means that 
if the second derivative is negative we may walk into the direction of increasing the value 
of f. That is a fatal flaw of the algorithm. Let’s see what happens in practice. 


c = torch.tensor(@.15 * np.pi) 


def f(x): # Objective function 
return x * torch.cos(c * x) 


def f_grad(x): # Gradient of the objective function 
return torch.cos(c x x) - c * x x torch.sin(c * x) 


def f_hess(x): # Hessian of the objective function 
return - 2 x c x torch.sin(c * x) - x x cx*2 x torch.cos(c x x) 


show_trace(newton(), f) 


epoch 10, x: tensor(26.8341) 


This went spectacularly wrong. How can we fix it? One way would be to “fix” the Hessian 
by taking its absolute value instead. Another strategy is to bring back the learning rate. 
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This seems to defeat the purpose, but not quite. Having second-order information allows 
us to be cautious whenever the curvature is large and to take longer steps whenever the 
objective function is flatter. Let’s see how this works with a slightly smaller learning rate, 
say 7 = 0.5. As we can see, we have quite an efficient algorithm. 


show_trace(newton(@.5), f) 


epoch 10, x: tensor(7.2699) 


F(x) 
Oo 


Convergence Analysis 


We only analyze the convergence rate of Newton’s method for some convex and three times 
differentiable objective function f, where the second derivative is nonzero, i.e., f” > 0. 
The multivariate proof is a straightforward extension of the one-dimensional argument be- 
low and omitted since it does not help us much in terms of intuition. 


À i def (k ; 
Denote by x‘*) the value of x at the k" iteration and let e ®© = x) _x* be the distance from 
optimality at the k® iteration. By Taylor expansion we have that the condition f’(x*) = 0 
can be written as 


0= Fx — e®) = f(x) eO £2) + ste)? ¢"(E), (12.3.10) 


which holds for some £% e€ [x(® —e), x) ]. Dividing the above expansion by f” (x) 
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yields 


(x(k) melk) 
a LUE | wy2t (E) (12.3.11) 
f” (x) 2 Je (x(*)) 
Recall that we have the update x!) =x — f’(x()/ f(x“). Plugging in this update 
equation and taking the absolute value of both sides, we have 
f” (E®)| 
PEN 
Consequently, whenever we are in a region of bounded rE) 2f” (x®)) < c, we 
have a quadratically decreasing error 


Jew (12.3.12) 


1 
ee L (ein 


eD] < efe), (12.3.13) 


As an aside, optimization researchers call this linear convergence, whereas a condition such 
as jeD] <a je)| would be called a constant rate of convergence. Note that this analysis 
comes with a number of caveats. First, we do not really have much of a guarantee when we 
will reach the region of rapid convergence. Instead, we only know that once we reach it, 
convergence will be very quick. Second, this analysis requires that f is well-behaved up to 
higher-order derivatives. It comes down to ensuring that f does not have any “surprising” 
properties in terms of how it might change its values. 


Preconditioning 


Quite unsurprisingly computing and storing the full Hessian is very expensive. It is thus 
desirable to find alternatives. One way to improve matters is preconditioning. It avoids 
computing the Hessian in its entirety but only computes the diagonal entries. This leads to 
update algorithms of the form 


x — x — ndiag(H)~!V f(x). (12.3.14) 


While this is not quite as good as the full Newton’s method, it is still much better than not 
using it. To see why this might be a good idea consider a situation where one variable 
denotes height in millimeters and the other one denotes height in kilometers. Assuming 
that for both the natural scale is in meters, we have a terrible mismatch in parametrizations. 
Fortunately, using preconditioning removes this. Effectively preconditioning with gradient 
descent amounts to selecting a different learning rate for each variable (coordinate of vector 
x). As we will see later, preconditioning drives some of the innovation in stochastic gradient 
descent optimization algorithms. 


Gradient Descent with Line Search 


One of the key problems in gradient descent is that we might overshoot the goal or make 
insufficient progress. A simple fix for the problem is to use line search in conjunction with 
gradient descent. That is, we use the direction given by V f(x) and then perform binary 
search as to which learning rate 7 minimizes f(x — nV f(x)). 
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This algorithm converges rapidly (for an analysis and proof see e.g., Boyd and Vanden- 
berghe (2004)). However, for the purpose of deep learning this is not quite so feasible, 
since each step of the line search would require us to evaluate the objective function on the 
entire dataset. This is way too costly to accomplish. 


12.3.4 Summary 


e Learning rates matter. Too large and we diverge, too small and we do not make progress. 
e Gradient descent can get stuck in local minima. 

e In high dimensions adjusting the learning rate is complicated. 

e Preconditioning can help with scale adjustment. 

e Newton’s method is a lot faster once it has started working properly in convex problems. 


e Beware of using Newton’s method without any adjustments for nonconvex problems. 


12.3.5 Exercises 


1. Experiment with different learning rates and objective functions for gradient descent. 
2. Implement line search to minimize a convex function in the interval [a, b]. 


1. Do you need derivatives for binary search, i.e., to decide whether to pick [a, (a + 
b)/2] or [(a + b)/2, b]. 


2. How rapid is the rate of convergence for the algorithm? 
3. Implement the algorithm and apply it to minimizing log(exp(x) + exp(—2x — 3)). 


3. Design an objective function defined on R? where gradient descent is exceedingly slow. 
Hint: scale different coordinates differently. 


4. Implement the lightweight version of Newton’s method using preconditioning: 
1. Use diagonal Hessian as preconditioner. 
2. Use the absolute values of that rather than the actual (possibly signed) values. 
3. Apply this to the problem above. 


5. Apply the algorithm above to a number of objective functions (convex or not). What 
happens if you rotate coordinates by 45 degrees? 


Discussions 168, 


493 


Stochastic Gradient Descent 


12.4 Stochastic Gradient Descent 
Si a 


In earlier chapters we kept using stochastic gradient descent in our training procedure, how- 
ever, without explaining why it works. To shed some light on it, we just described the basic 
principles of gradient descent in Section 12.3. In this section, we go on to discuss stochastic 
gradient descent in greater detail. 


zmatplotlib inline 

import math 

import torch 

from d21 import torch as d21 


12.4.1 Stochastic Gradient Updates 


In deep learning, the objective function is usually the average of the loss functions for each 
example in the training dataset. Given a training dataset of n examples, we assume that 
fi (x) is the loss function with respect to the training example of index i, where x is the 
parameter vector. Then we arrive at the objective function 


1 n 
f(x) = n LA. (12.4.1) 
The gradient of the objective function at x is computed as 
1 n 
V =- >) Vfi(x). A, 
f(x) 22 fix) (12.4.2) 


If gradient descent is used, the computational cost for each independent variable iteration 
is O(n), which grows linearly with n. Therefore, when the training dataset is larger, the 
cost of gradient descent for each iteration will be higher. 


Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each 
iteration of stochastic gradient descent, we uniformly sample an index i € {1,...,n} for 
data examples at random, and compute the gradient V f; (x) to update x: 


x — x-— nV fj(x), (12.4.3) 


where 77 is the learning rate. We can see that the computational cost for each iteration drops 
from O(n) of the gradient descent to the constant O(1). Moreover, we want to empha- 
size that the stochastic gradient V f;(x) is an unbiased estimate of the full gradient V f(x) 
because 


E;V f(x) = DRE = Vf(x). (12.4.4) 
i=1 


This means that, on average, the stochastic gradient is a good estimate of the gradient. 


Now, we will compare it with gradient descent by adding random noise with a mean of 0 
and a variance of | to the gradient to simulate a stochastic gradient descent. 
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def f(x1, x2): # Objective function 
return x1 ** 2 + 2 * x2 xx 2 


def f_grad(xl, x2): # Gradient of the objective function 
return 2 * x1, 4 * x2 


def sgd(x1, x2, sl, s2, f_grad): 
gl, g2 = f_grad(x1, x2) 
# Simulate noisy gradient 
g1 += torch.normal(0.0, 1, (1,)).itemQd 
g2 += torch.normal(0.@, 1, (1,)).itemQ) 
eta_t = eta x 1r() 
return (xl - eta_t * gl, x2 - eta_t * g2, Q, 0) 


def constant_Ir(): 
return 1 


eta = 0.1 


lr = constant_lr # Constant learning rate 
d21.show_trace_2d(f, d21.train_2d(sgd, steps=50, f_grad=f_grad)) 


epoch 50, x1: @.225517, x2: -0.076646 


As we can see, the trajectory of the variables in the stochastic gradient descent is much 
more noisy than the one we observed in gradient descent in Section 12.3. This is due to 
the stochastic nature of the gradient. That is, even when we arrive near the minimum, 
we are still subject to the uncertainty injected by the instantaneous gradient via nV f; (x). 
Even after 50 steps the quality is still not so good. Even worse, it will not improve after 
additional steps (we encourage you to experiment with a larger number of steps to confirm 
this). This leaves us with the only alternative: change the learning rate 7. However, if we 
pick this too small, we will not make any meaningful progress initially. On the other hand, 
if we pick it too large, we will not get a good solution, as seen above. The only way to 
resolve these conflicting goals is to reduce the learning rate dynamically as optimization 
progresses. 


This is also the reason for adding a learning rate function 1r into the sgd step function. In 
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the example above any functionality for learning rate scheduling lies dormant as we set the 
associated 1r function to be constant. 


12.4.2 Dynamic Learning Rate 


Replacing 7 with a time-dependent learning rate 7(t) adds to the complexity of controlling 
convergence of an optimization algorithm. In particular, we need to figure out how rapidly 
7 Should decay. If it is too quick, we will stop optimizing prematurely. If we decrease 
it too slowly, we waste too much time on optimization. The following are a few basic 
strategies that are used in adjusting 7 over time (we will discuss more advanced strategies 
later): 


n(t) =n; if ti < t < tiı piecewise constant 
t 


n(t) = qo: e~? exponential decay (12.4.5) 


n(t) = no: (Bt+1)7 polynomial decay 


In the first piecewise constant scenario we decrease the learning rate, e.g., whenever progress 
in optimization stalls. This is a common strategy for training deep networks. Alternatively 

we could decrease it much more aggressively by an exponential decay. Unfortunately this 

often leads to premature stopping before the algorithm has converged. A popular choice is 

polynomial decay with a = 0.5. In the case of convex optimization there are a number of 
proofs that show that this rate is well behaved. 


Let’s see what the exponential decay looks like in practice. 


def exponential_Ir(): 
# Global variable that is defined outside this function and updated inside 
global t 
t += 1 
return math.exp(-0.1 x t) 


t=1 


lr = exponential_lr 
d21.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad)) 


epoch 1000, x1: -0.758829, x2: -0.115584 


As expected, the variance in the parameters is significantly reduced. However, this comes 
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at the expense of failing to converge to the optimal solution x = (0,0). Even after 1000 
iteration steps are we are still very far away from the optimal solution. Indeed, the algorithm 
fails to converge at all. On the other hand, if we use a polynomial decay where the learning 
rate decays with the inverse square root of the number of steps, convergence gets better 
after only 50 steps. 


def polynomial_Ir(): 
# Global variable that is defined outside this function and updated inside 
global t 
t += 1 
return (1 + @.1 * t) ** (-@.5) 


t=1 


lr = polynomial_lr 
d21.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad)) 


epoch 50, x1: 0.144834, x2: 0.041688 


There exist many more choices for how to set the learning rate. For instance, we could start 
with a small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We 
could even alternate between smaller and larger learning rates. There exists a large variety 
of such schedules. For now let’s focus on learning rate schedules for which a comprehen- 
sive theoretical analysis is possible, i.e., on learning rates in a convex setting. For general 
nonconvex problems it is very difficult to obtain meaningful convergence guarantees, since 
in general minimizing nonlinear nonconvex problems is NP hard. For a survey see e.g., the 
excellent lecture notes 16° of Tibshirani 2015. 


12.4.3 Convergence Analysis for Convex Objectives 


The following convergence analysis of stochastic gradient descent for convex objective 
functions is optional and primarily serves to convey more intuition about the problem. We 
limit ourselves to one of the simplest proofs (Nesterov and Vial, 2000). Significantly more 
advanced proof techniques exist, e.g., whenever the objective function is particularly well 
behaved. 


Suppose that the objective function f(&, x) is convex in x for all €. More concretely, we 
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consider the stochastic gradient descent update: 


X41 = Xr — NOx f (E, X), (12.4.6) 


where f(&,, x) is the objective function with respect to the training example £, drawn from 
some distribution at step t and x is the model parameter. Denote by 


R(x) = E[ f (8, x)] (12.4.7) 


the expected risk and by R* its minimum with regard to x. Last let x* be the minimizer 
(we assume that it exists within the domain where x is defined). In this case we can track 
the distance between the current parameter x, at time ¢ and the risk minimizer x* and see 
whether it improves over time: 


[x+ — x" |? 
= [x2 — Mx f(E x) — x" I? (12.4.8) 
= [x = x" | + flO f (Es, II? — 2p (xt — x*, Of (E x)) - 


We assume that the & norm of stochastic gradient ôx f (€,, x) is bounded by some constant 
L, hence we have that 


Plak En DI < nL. (12.4.9) 


We are mostly interested in how the distance between x, and x* changes in expectation. 
In fact, for any specific sequence of steps the distance might well increase, depending on 
whichever €, we encounter. Hence we need to bound the dot product. Since for any convex 
function f it holds that f(y) > f(x) + (f’(x), y — x) for all x and y, by convexity we 
have 


FEX") 2 f (Ep, X) + (x — xi, Oxf (E,,X1)) « (12.4.10) 
Plugging both inequalities (12.4.9) and (12.4.10) into (12.4.8) we obtain a bound on the 


distance between parameters at time t + 1 as follows: 


lx- x" ||? lx- x" |? = 20 FE, x) — FE, x") — PL’. (12.4.11) 


This means that we make progress as long as the difference between current loss and the 
optimal loss outweighs 7, L*/2. Since this difference is bound to converge to zero it follows 
that the learning rate 77; also needs to vanish. 


Next we take expectations over (12.4.11). This yields 
E [Ilx:—x"|?] - E [lx — x77] = 2m [EIRE] R] -nL (12.4.12) 
The last step involves summing over the inequalities for t € {1,...,7}. Since the sum 


telescopes and by dropping the lower term we obtain 


T 


T 
IIx: - x*I? > 2 (>: | [E[R(x,)] - R*]- L? Y n2. (12.4.13) 
t=1 


t=1 
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Note that we exploited that x; is given and thus the expectation can be dropped. Last 
define 


T 
_ def Dupe, MX 
x = ~= 


=a : i (12.4.14) 
t=] ‘It 
Since 
T T 
E ae _ he Eo = E[R(x,)], (12.4.15) 
t= ‘It t=1 It 


by Jensen’s inequality (setting i = t, a; = mld n: in (12.2.3)) and convexity of R it 
follows that E[R(x,)] => E[R(x)], thus 


T T 
» mE|R(x:)] > 2 mE [R(3)] . (12.4.16) 
t=1 t=1 


Plugging this into the inequality (12.4.13) yields the bound 
P+L 2 i n? 


[E[x]] - R* < In, 


i (12.4.17) 


def 
where r? ‘= | 


|x; — x* ||? is a bound on the distance between the initial choice of parameters 
and the final outcome. In short, the speed of convergence depends on how the norm of 
stochastic gradient is bounded (L) and how far away from optimality the initial parameter 
value is (r). Note that the bound is in terms of X rather than x7. This is the case since x is 
a smoothed version of the optimization path. Whenever r, L, and T are known we can pick 
the learning rate 7 = r/(LVT). This yields as upper bound rL/VT. That is, we converge 
with rate O(1/VT) to the optimal solution. 


12.4.4 Stochastic Gradients and Finite Samples 


So far we have played a bit fast and loose when it comes to talking about stochastic gra- 

dient descent. We posited that we draw instances x;, typically with labels y; from some 

distribution p(x, y) and that we use this to update the model parameters in some man- 

ner. In particular, for a finite sample size we simply argued that the discrete distribution 
1 


P(X, y) = = Dijx Ox; (x)ôy; (y) for some functions ôx; and ô, allows us to perform stochas- 


tic gradient descent over it. 


However, this is not really what we did. In the toy examples in the current section we 
simply added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs 
(xi, yi). It turns out that this is justified here (see the exercises for a detailed discussion). 
More troubling is that in all previous discussions we clearly did not do this. Instead we 
iterated over all instances exactly once. To see why this is preferable consider the converse, 
namely that we are sampling n observations from the discrete distribution with replacement. 
The probability of choosing an element i at random is 1/n. Thus to choose it at least once 
is 


P(choose i) = 1 — P(omit i) = 1 - (1 — 1/n)” x 1 — e™! x 0.63. (12.4.18) 
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A similar reasoning shows that the probability of picking some sample (i.e., training exam- 
ple) exactly once is given by 


HE | 7 iy" =- f a ) we! = 0.37. (12.4.19) 


ljn n n 


Sampling with replacement leads to an increased variance and decreased data efficiency 
relative to sampling without replacement. Hence, in practice we perform the latter (and 
this is the default choice throughout this book). Last note that repeated passes through the 
training dataset traverse it in a different random order. 


12.4.5 Summary 


e For convex problems we can prove that for a wide choice of learning rates stochastic 
gradient descent will converge to the optimal solution. 


For deep learning this is generally not the case. However, the analysis of convex prob- 
lems gives us useful insight into how to approach optimization, namely to reduce the 
learning rate progressively, albeit not too quickly. 


Problems occur when the learning rate is too small or too large. In practice a suitable 
learning rate is often found only after multiple experiments. 


When there are more examples in the training dataset, it costs more to compute each 
iteration for gradient descent, so stochastic gradient descent is preferred in these cases. 


Optimality guarantees for stochastic gradient descent are in general not available in non- 
convex cases since the number of local minima that require checking might well be 
exponential. 


12.4.6 Exercises 


1. Experiment with different learning rate schedules for stochastic gradient descent and 
with different numbers of iterations. In particular, plot the distance from the optimal 
solution (0,0) as a function of the number of iterations. 


2. Prove that for the function f (x1, x2) = x? + 2x7 adding normal noise to the gradient is 
equivalent to minimizing a loss function f(x, w) = (xı — w1)? + 2(x2 — w2)? where x 
is drawn from a normal distribution. 


3. Compare convergence of stochastic gradient descent when you sample from {(x1, y1),.. 
with replacement and when you sample without replacement. 


4. How would you change the stochastic gradient descent solver if some gradient (or rather 
some coordinate associated with it) was consistently larger than all the other gradients? 


5. Assume that f(x) = x?(1 + sinx). How many local minima does f have? Can you 
change f in such a way that to minimize it one needs to evaluate all the local minima? 


Discussions!7°, 
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12.5 Minibatch Stochastic Gradient Descent 


So far we encountered two extremes in the approach to gradient-based learning: Section 
12.3 uses the full dataset to compute gradients and to update parameters, one pass at a time. 
Conversely Section 12.4 processes one training example at a time to make progress. Either 
of them has its own drawbacks. Gradient descent is not particularly data efficient whenever 
data is very similar. Stochastic gradient descent is not particularly computationally efficient 
since CPUs and GPUs cannot exploit the full power of vectorization. This suggests that 
there might be something in between, and in fact, that is what we have been using so far in 
the examples we discussed. 


12.5.1 Vectorization and Caches 


At the heart of the decision to use minibatches is computational efficiency. This is most 
easily understood when considering parallelization to multiple GPUs and multiple servers. 
In this case we need to send at least one image to each GPU. With 8 GPUs per server and 
16 servers we already arrive at a minibatch size no smaller than 128. 


Things are a bit more subtle when it comes to single GPUs or even CPUs. These devices 
have multiple types of memory, often multiple types of computational units and different 
bandwidth constraints between them. For instance, a CPU has a small number of registers 
and then the L1, L2, and in some cases even L3 cache (which is shared among different 
processor cores). These caches are of increasing size and latency (and at the same time 
they are of decreasing bandwidth). Suffice to say, the processor is capable of performing 
many more operations than what the main memory interface is able to provide. 


First, a 2GHz CPU with 16 cores and AVX-512 vectorization can process up to 2 - 10? - 
16 - 32 = 10! bytes per second. The capability of GPUs easily exceeds this number by a 
factor of 100. On the other hand, a midrange server processor might not have much more 
than 100 GB/s bandwidth, i.e., less than one tenth of what would be required to keep the 
processor fed. To make matters worse, not all memory access is created equal: memory 
interfaces are typically 64 bit wide or wider (e.g., on GPUs up to 384 bit), hence reading a 
single byte incurs the cost of a much wider access. 


Second, there is significant overhead for the first access whereas sequential access is rela- 
tively cheap (this is often called a burst read). There are many more things to keep in mind, 
such as caching when we have multiple sockets, chiplets, and other structures. See this 
Wikipedia article !”! for a more in-depth discussion. 


The way to alleviate these constraints is to use a hierarchy of CPU caches that are actu- 
ally fast enough to supply the processor with data. This is the driving force behind batch- 
ing in deep learning. To keep matters simple, consider matrix-matrix multiplication, say 
A = BC. We have a number of options for calculating A. For instance, we could try the 
following: 
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1. We could compute A;; = B;,:C.,;, i.e., we could compute it elementwise by means of 
dot products. 


2. We could compute A.; = BC., j, i.e., we could compute it one column at a time. 
Likewise we could compute A one row A;, at a time. 


3. We could simply compute A = BC. 


4. We could break B and C into smaller block matrices and compute A one block at a 
time. 


If we follow the first option, we will need to copy one row and one column vector into the 
CPU each time we want to compute an element A;;. Even worse, due to the fact that matrix 
elements are aligned sequentially we are thus required to access many disjoint locations 
for one of the two vectors as we read them from memory. The second option is much 
more favorable. In it, we are able to keep the column vector C.,; in the CPU cache while 
we keep on traversing through B. This halves the memory bandwidth requirement with 
correspondingly faster access. Of course, option 3 is most desirable. Unfortunately, most 
matrices might not entirely fit into cache (this is what we are discussing after all). However, 
option 4 offers a practically useful alternative: we can move blocks of the matrix into cache 
and multiply them locally. Optimized libraries take care of this for us. Let’s have a look at 
how efficient these operations are in practice. 


Beyond computational efficiency, the overhead introduced by Python and by the deep learn- 
ing framework itself is considerable. Recall that each time we execute a command the 
Python interpreter sends a command to the MXNet engine which needs to insert it into 
the computational graph and deal with it during scheduling. Such overhead can be quite 
detrimental. In short, it is highly advisable to use vectorization (and matrices) whenever 
possible. 


%matplotlib inline 

import time 

import numpy as np 

import torch 

from torch import nn 

from d21 import torch as d21 


A = torch.zeros(256, 256) 
B = torch.randn(256, 256) 
C = torch.randn(256, 256) 


Since we will benchmark the running time frequently in the rest of the book, let’s define a 
timer. 


class Timer: #@save 
"""Record multiple running times. 
def __init__(self): 
self.times = [] 
self.startQ) 


nnn 


(continues on next page) 
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(continued from previous page) 


def start(self): 
"""Start the timer.”"” 
self.tik = time. time() 


def stop(self): 
"""Stop the timer and record the time in a list. 
self.times.append(time.time() - self.tik) 
return self.times[-1] 


nnn 


def avg(self): 
"""Return the average time. 
return sum(self.times) / len(self.times) 


nnn 


def sum(self): 
"""Return the sum of time. 
return sum(self.times) 


nnn 


def cumsum(self): 
"""Return the accumulated time. 
return np.array(self.times).cumsum().tolist() 


nnn 


timer = Timer() 


Element-wise assignment simply iterates over all rows and columns of B and C respectively 
to assign the value to A. 


# Compute A = BC one element at a time 
timer.start() 
for i in range(256): 
for j in range(256): 
ALi, j] = torch.dot(BLi, :], C[:, j]) 
timer .stop() 


1. 7845737934112549 


A faster strategy is to perform column-wise assignment. 


# Compute A = BC one column at a time 
timer.start() 
for j in range(256): 

AL:, j] = torch.mv(B, C[:, jJ) 
timer.stop() 


@.06541275978088379 


Last, the most effective manner is to perform the entire operation in one block. Note that 
multiplying any two matrices B € R”*” and C € R”*P takes approximately 2mnp floating 
point operations, when scalar multiplication and addition are counted as separate operations 
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(fused in practice). Thus, multiplying two 256 x 256 matrices takes 0.03 billion floating 
point operations. Let’s see what the respective speed of the operations is. 


# Compute A = BC in one go 
timer.start() 

A = torch.mm(B, C) 

timer .stop() 


gigaflops = [0.03 / i for i in timer.times] 
print(f'performance in Gigaflops: element {gigaflops[@]:.3f}, ' 
f'column {gigaflops[1]:.3f}, full {gigaflops[2]: .3f}') 


performance in Gigaflops: element 2.017, column @.459, full 51.633 


12.5.2 Minibatches 


In the past we took it for granted that we would read minibatches of data rather than single 
observations to update parameters. We now give a brief justification for it. Processing sin- 
gle observations requires us to perform many single matrix-vector (or even vector-vector) 
multiplications, which is quite expensive and which incurs a significant overhead on behalf 
of the underlying deep learning framework. This applies both to evaluating a network when 
applied to data (often referred to as inference) and when computing gradients to update pa- 
rameters. That is, this applies whenever we perform w — w — 7;g; where 


Sr = Ow f (Xr, W) (12.5.1) 


We can increase the computational efficiency of this operation by applying it to a minibatch 
of observations at a time. That is, we replace the gradient g, over a single observation by 
one over a small batch 


8r 


eE p2 fxi w) (12.5.2) 
ic B, 

Let’s see what this does to the statistical properties of g;: since both x; and also all elements 

of the minibatch $, are drawn uniformly at random from the training set, the expectation of 

the gradient remains unchanged. The variance, on "i other hand, is reduced significantly. 


Since the minibatch gradient is composed of pt =" |B, | independent gradients which are 
being averaged, its standard deviation is reduced by a factor of b~2. This, by itself, is a good 
thing, since it means that the updates are more reliably aligned with the full gradient. 


Naively this would indicate that choosing a large minibatch 8, would be universally desir- 
able. Alas, after some point, the additional reduction in standard deviation is minimal when 
compared to the linear increase in computational cost. In practice we pick a minibatch that 
is large enough to offer good computational efficiency while still fitting into the memory of 
a GPU. To illustrate the savings let’s have a look at some code. In it we perform the same 
matrix-matrix multiplication, but this time broken up into “minibatches” of 64 columns at 
a time. 
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timer.start() 
for j in range(@, 256, 64): 
AL:, j:j+64] = torch.mm(B, CL:, j:j+64]) 
timer.stop() 
print(f'performance in Gigaflops: block {0.03 / timer.times[3]: .3f}’) 


performance in Gigaflops: block 37.640 


AS we can see, the computation on the minibatch is essentially as efficient as on the full 
matrix. A word of caution is in order. In Section 8.5 we used a type of regularization that 
was heavily dependent on the amount of variance in a minibatch. As we increase the latter, 
the variance decreases and with it the benefit of the noise-injection due to batch normal- 
ization. See e.g., Ioffe (2017) for details on how to rescale and compute the appropriate 
terms. 


12.5.3 Reading the Dataset 


Let’s have a look at how minibatches are efficiently generated from data. In the following 
we use a dataset developed by NASA to test the wing noise from different aircraft 1”? 
to compare these optimization algorithms. For convenience we only use the first 1, 500 
examples. The data is whitened for preprocessing, i.e., we remove the mean and rescale the 
variance to | per coordinate. 


#@save 
d21.DATA_HUBL’airfoil’] = (d21.DATA_URL + ‘airfoil_self_noise.dat’, 
"76e5be1548fd8222e5074cf0faae75edff8cf93F') 


#@save 
def get_data_ch1l1(batch_size=10, n=1500): 
data = np.genfromtxt(d21.download(’airfoil’), 
dtype=np.float32, delimiter='\t') 
data = torch.from_numpy((data - data.mean(axis=0)) / data.std(axis=0)) 
data_iter = d21.load_array((dataL[:n, :-1], dataL[:n, -1]), 
batch_size, is_train=True) 
return data_iter, data.shape[1]-1 


12.5.4 Implementation from Scratch 


Recall the minibatch stochastic gradient descent implementation from Section 3.4. In the 
following we provide a slightly more general implementation. For convenience it has the 
same call signature as the other optimization algorithms introduced later in this chapter. 
Specifically, we add the status input states and place the hyperparameter in dictionary 
hyperparams. In addition, we will average the loss of each minibatch example in the train- 
ing function, so the gradient in the optimization algorithm does not need to be divided by 
the batch size. 
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def sgd(params, states, hyperparams): 
for p in params: 
p.data.sub_(hyperparams[’1r'] * p.grad) 
p.grad.data.zero_() 


Next, we implement a generic training function to facilitate the use of the other optimization 
algorithms introduced later in this chapter. It initializes a linear regression model and can 
be used to train the model with minibatch stochastic gradient descent and other algorithms 
introduced subsequently. 


#@save 
def train_chli(trainer_fn, states, hyperparams, data_iter, 
feature_dim, num_epochs=2): 
# Initialization 
w = torch.normal(mean=0.0@, std=0.01, size=(feature_dim, 1), 
requires_grad=True) 
b = torch.zeros((1), requires_grad=True) 
net, loss = lambda X: d21.linreg(X, w, b), d21.squared_loss 
# Train 
animator = d21.Animator(xlabel='epoch’, ylabel='loss’, 
xlim=[0, num_epochs], ylim=[@.22, @.35]) 
n, timer = 0, d21.Timer() 
for _ in range(num_epochs): 
for X, y in data_iter: 
1 = loss(net(X), y).mean() 
1. backward() 
trainer_fn(Lw, b], states, hyperparams) 
n += X.shape[Q] 
if n % 200 == Q: 
timer.stop() 
animator .add(n/X.shape[@]/len(data_iter), 
(d21.evaluate_loss(net, data_iter, loss),)) 
timer.start() 
print(f'loss: {animator.YLQ][-1]:.3f}, {timer.sum()/num_epochs: .3f} sec/ 
epoch’) 
return timer.cumsum(), animator.Y[@] 


Let’s see how optimization proceeds for batch gradient descent. This can be achieved by 
setting the minibatch size to 1500 (i.e., to the total number of examples). As a result the 
model parameters are updated only once per epoch. There is little progress. In fact, after 6 
steps progress stalls. 


def train_sgd(1r, batch_size, num_epochs=2): 
data_iter, feature_dim = get_data_ch11(batch_size) 
return train_ch11( 
sgd, None, {'lr': Ir}, data_iter, feature_dim, num_epochs) 


gd_res = train_sgd(1, 1500, 10) 


loss: 0.247, 0.020 sec/epoch 
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When the batch size equals 1, we use stochastic gradient descent for optimization. For 
simplicity of implementation we picked a constant (albeit small) learning rate. In stochastic 
gradient descent, the model parameters are updated whenever an example is processed. In 
our case this amounts to 1500 updates per epoch. As we can see, the decline in the value of 
the objective function slows down after one epoch. Although both the procedures processed 
1500 examples within one epoch, stochastic gradient descent consumes more time than 
gradient descent in our experiment. This is because stochastic gradient descent updated 
the parameters more frequently and since it is less efficient to process single observations 
one at a time. 


sgd_res = train_sgd(@.005, 1) 


loss: 0.245, 0.685 sec/epoch 
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Finally, when the batch size equals 100, we use minibatch stochastic gradient descent for 
optimization. The time required per epoch is shorter than the time needed for stochastic 
gradient descent and the time for batch gradient descent. 


minil_res = train_sgd(.4, 100) 


loss: 0.246, 0.025 sec/epoch 


Reducing the batch size to 10, the time for each epoch increases because the workload for 
each batch is less efficient to execute. 
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mini2_res = train_sgd(.05, 10) 


loss: 0.246, @.09@ sec/epoch 
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Now we can compare the time vs. loss for the previous four experiments. As can be seen, 
although stochastic gradient descent converges faster than GD in terms of number of ex- 
amples processed, it uses more time to reach the same loss than GD because computing the 
gradient example by example is not as efficient. Minibatch stochastic gradient descent is 
able to trade-off convergence speed and computation efficiency. A minibatch size of 10 is 
more efficient than stochastic gradient descent; a minibatch size of 100 even outperforms 
GD in terms of runtime. 


d21.set_figsize([6, 3]) 
d21.plot(*list(map(list, zip(gd_res, sgd_res, minil_res, mini2_res))), 
"time (sec)', ‘loss’, xlim=[le-2, 10], 
legend=['gd', ‘sgd’, ‘batch size=100', ‘batch size=10']) 
d21.plt.gca() .set_xscale('log’) 


12.5.5 Concise Implementation 


In Gluon, we can use the Trainer class to call optimization algorithms. This is used to im- 
plement a generic training function. We will use this throughout the current chapter. 
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#@save 
def train_concise_chl11(trainer_fn, hyperparams, data_iter, num_epochs=4): 
# Initialization 
net = nn.Sequential(nn.Linear(5, 1)) 
def init_weights(module): 
if type(module) == nn.Linear: 
torch.nn.init.normal_(module.weight, std=@.01) 
net.apply(init_weights) 


optimizer = trainer_fn(net.parameters(), *xhyperparams) 
loss = nn.MSELoss(reduction='none’) 
animator = d21.Animator(xlabel='epoch’, ylabel='loss’, 
xlim=[0, num_epochs], ylim=[@.22, @.35]) 
n, timer = 0, d21.Timer() 
for _ in range(num_epochs) : 
for X, y in data_iter: 
optimizer. zero_grad() 
out = net(X) 
y = y.reshape(out. shape) 
1 = loss(out, y) 
1.mean() . backward() 
optimizer.step() 
n += X.shape[@] 
if n % 200 == Q: 
timer.stop() 
# ‘MSELoss*‘ computes squared error without the 1/2 factor 
animator.add(n/X.shapelQ]/len(data_iter), 
(d21.evaluate_loss(net, data_iter, loss) / 2,)) 
timer.start() 
print(f'loss: {animator.Y[Q@][-1]:.3f}, {timer.sum()/num_epochs: .3f} sec/ 
epoch’) 


II 


Using Gluon to repeat the last experiment shows identical behavior. 


data_iter, _ = get_data_ch11(10) 
trainer = torch.optim.SGD 
train_concise_ch1l(trainer, {'lr': 0.01}, data_iter) 
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loss 


loss: 0.243, 9.096 sec/epoch 


12.5.6 Summary 


e Vectorization makes code more efficient due to reduced overhead arising from the deep 
learning framework and due to better memory locality and caching on CPUs and 
GPUs. 


There is a trade-off between statistical efficiency arising from stochastic gradient descent 
and computational efficiency arising from processing large batches of data at a time. 


Minibatch stochastic gradient descent offers the best of both worlds: computational and 
statistical efficiency. 


In minibatch stochastic gradient descent we process batches of data obtained by arandom 
permutation of the training data (i.e., each observation is processed only once per 
epoch, albeit in random order). 


It is advisable to decay the learning rates during training. 


In general, minibatch stochastic gradient descent is faster than stochastic gradient descent 
and gradient descent for convergence to a smaller risk, when measured in terms of 
clock time. 


12.5.7 Exercises 


1. Modify the batch size and learning rate and observe the rate of decline for the value of 
the objective function and the time consumed in each epoch. 


2. Read the MXNet documentation and use the Trainer class set_learning_rate func- 
tion to reduce the learning rate of the minibatch stochastic gradient descent to 1/10 of 
its previous value after each epoch. 


3. Compare minibatch stochastic gradient descent with a variant that actually samples with 
replacement from the training set. What happens? 


4. An evil genie replicates your dataset without telling you (i.e., each observation occurs 
twice and your dataset grows to twice its original size, but nobody told you). How does 
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the behavior of stochastic gradient descent, minibatch stochastic gradient descent and 
that of gradient descent change? 


Discussions!73. 


12.6 Momentum 
E F) 


In Section 12.4 we reviewed what happens when performing stochastic gradient descent, 
i.e., when performing optimization where only a noisy variant of the gradient is available. 
In particular, we noticed that for noisy gradients we need to be extra cautious when it comes 
to choosing the learning rate in the face of noise. If we decrease it too rapidly, convergence 
stalls. If we are too lenient, we fail to converge to a good enough solution since noise keeps 
on driving us away from optimality. 


12.6.1 Basics 


In this section, we will explore more effective optimization algorithms, especially for cer- 
tain types of optimization problems that are common in practice. 


Leaky Averages 


The previous section saw us discussing minibatch SGD as a means for accelerating com- 
putation. It also had the nice side-effect that averaging gradients reduced the amount of 
variance. The minibatch stochastic gradient descent can be calculated by: 


1 1 
81-1 = Ow Tg] 2 fi Wr-1) = Bl 2 h;,t-1. (12.6.1) 
To keep the notation simple, here we used h; ;-1 = Ow f (Xi, W;-1) as the stochastic gra- 
dient descent for sample i using the weights updated at time ¢ — 1. It would be nice if we 
could benefit from the effect of variance reduction even beyond averaging gradients on a 
minibatch. One option to accomplish this task is to replace the gradient computation by a 
“leaky average”: 


Vi = PV1-1 + 8t.t-1 (12.6.2) 


for some £ € (0, 1). This effectively replaces the instantaneous gradient by one that is been 
averaged over multiple past gradients. v is called velocity. It accumulates past gradients 
similar to how a heavy ball rolling down the objective function landscape integrates over 
past forces. To see what is happening in more detail let’s expand v; recursively into 


t-l 
Vi = Bevi-2 + bgr-1,t-2 + Srt-1 =... = YB Geter; (12.6.3) 
T=0 


Large 6 amounts to a long-range average, whereas small 6 amounts to only a slight correc- 
tion relative to a gradient method. The new gradient replacement no longer points into the 
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direction of steepest descent on a particular instance any longer but rather in the direction 
of a weighted average of past gradients. This allows us to realize most of the benefits of 
averaging over a batch without the cost of actually computing the gradients on it. We will 
revisit this averaging procedure in more detail later. 


The above reasoning formed the basis for what is now known as accelerated gradient meth- 
ods, such as gradients with momentum. They enjoy the additional benefit of being much 
more effective in cases where the optimization problem is ill-conditioned (i.e., where there 
are some directions where progress is much slower than in others, resembling a narrow 
canyon). Furthermore, they allow us to average over subsequent gradients to obtain more 
stable directions of descent. Indeed, the aspect of acceleration even for noise-free convex 
problems is one of the key reasons why momentum works and why it works so well. 


As one would expect, due to its efficacy momentum is a well-studied subject in optimization 
for deep learning and beyond. See e.g., the beautiful expository article 174 by Goh (2017) 
for an in-depth analysis and interactive animation. It was proposed by Polyak (1964). Nes- 
terov (2018) has a detailed theoretical discussion in the context of convex optimization. 
Momentum in deep learning has been known to be beneficial for a long time. See e.g., the 
discussion by Sutskever et al. (2013) for details. 


An Ill-conditioned Problem 


To get a better understanding of the geometric properties of the momentum method we 
revisit gradient descent, albeit with a significantly less pleasant objective function. Recall 
that in Section 12.3 we used f(x) = x? +235, i.e., a moderately distorted ellipsoid objective. 
We distort this function further by stretching it out in the x; direction via 


f(x) =0.1x} $255. (12.6.4) 


As before f has its minimum at (0, 0). This function is very flat in the direction of xı. Let’s 
see what happens when we perform gradient descent as before on this new function. We 
pick a learning rate of 0.4. 


%matplotlib inline 
import torch 
from d21 import torch as d21 


eta = 0.4 
def f_2d(x1, x2): 
return O21 * xl xx 2) 20% x2 xx 2 
def gd_2d(x1, x2, sl, s2): 
return (x1 - eta x @.2 x x1, x2 - eta * 4 x x2, Q, Q) 


d21.show_trace_2d(f_2d, d21.train_2d(gd_2d)) 


epoch 20, x1: -0.943467, x2: -0.000073 


By construction, the gradient in the x2 direction is much higher and changes much more 
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rapidly than in the horizontal x; direction. Thus we are stuck between two undesirable 
choices: if we pick a small learning rate we ensure that the solution does not diverge in 
the x2 direction but we are saddled with slow convergence in the xı direction. Conversely, 
with a large learning rate we progress rapidly in the x, direction but diverge in x2. The 
example below illustrates what happens even after a slight increase in learning rate from 
0.4 to 0.6. Convergence in the x; direction improves but the overall solution quality is much 
worse. 


eta = 0.6 
d21.show_trace_2d(f_2d, d21.train_2d(gd_2d)) 


epoch 20, xl: -0.387814, x2: -1673.365109 
1000 4 
0 
N 
x< 
—1000 4 
-4 -2 0 
xl 
The Momentum Method 


The momentum method allows us to solve the gradient descent problem described above. 
Looking at the optimization trace above we might intuit that averaging gradients over the 
past would work well. After all, in the x; direction this will aggregate well-aligned gradi- 
ents, thus increasing the distance we cover with every step. Conversely, in the x2 direction 
where gradients oscillate, an aggregate gradient will reduce step size due to oscillations 
that cancel each other out. Using v; instead of the gradient g; yields the following update 
equations: 


Vi — PVt-1 + St,1-1, 


Xt — Kp-1 ~ NtVt. 


(12.6.5) 
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Note that for 6 = 0 we recover regular gradient descent. Before delving deeper into the 
mathematical properties let’s have a quick look at how the algorithm behaves in prac- 
tice. 


def momentum_2d(x1, x2, v1, v2): 
v1 = beta * v1 + 0.2 x x1 
v2 = beta * v2 + 4 x x2 
return x1 - eta x v1, x2 - eta * v2, v1, v2 


eta, beta = 0.6, 0.5 
d21.show_trace_2d(f_2d, d21.train_2d(momentum_2d) ) 


epoch 20, x1: 0.007188, x2: 0.002553 


AS we can see, even with the same learning rate that we used before, momentum still con- 
verges well. Let’s see what happens when we decrease the momentum parameter. Halving 
it to 6 = 0.25 leads to a trajectory that barely converges at all. Nonetheless, it is a lot better 
than without momentum (when the solution diverges). 


eta, beta = 0.6, 0.25 
d21.show_trace_2d(f_2d, d21.train_2d(momentum_2d) ) 


epoch 20, x1: -0.126340, x2: -0.186632 


Note that we can combine momentum with stochastic gradient descent and in particular, 
minibatch stochastic gradient descent. The only change is that in that case we replace the 
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gradients g;;-1 with g,. Last, for convenience we initialize vo = O at time t = 0. Let’s 
look at what leaky averaging actually does to the updates. 


Effective Sample Weight 


Recall that v, = a B" gr-r,t-7-1- In the limit the terms add up to >'°_, 87 = T In 


other words, rather than taking a step of size ņ in gradient descent or stochastic gradient 


descent we take a step of size Te while at the same time, dealing with a potentially much 


better behaved descent direction. These are two benefits in one. To illustrate how weighting 
behaves for different choices of 6 consider the diagram below. 


d21.set_figsize() 
betas = [0.95, 0.9, 0.6, Q] 
for beta in betas: 
x = torch.arange(4Q) .detach() .numpy() 
d21.plt.plot(x, beta xx x, label=f’beta = {beta: .2f}’) 
d21.plt.xlabel(' time’) 
d21.plt.legend(); 
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12.6.2 Practical Experiments 


Let’s see how momentum works in practice, i.e., when used within the context of a proper 
optimizer. For this we need a somewhat more scalable implementation. 


Implementation from Scratch 


Compared with (minibatch) stochastic gradient descent the momentum method needs to 
maintain a set of auxiliary variables, i.e., velocity. It has the same shape as the gradients 
(and variables of the optimization problem). In the implementation below we call these 
variables states. 


def init_momentum_states(feature_dim) : 
v_w = torch.zeros((feature_dim, 1)) 
v_b = torch. zeros(1) 


return (v_w, v_b) 


l 
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def sgd_momentum(params, states, hyperparams): 
for p, v in zip(params, states): 
with torch.no_grad(): 
vL:] = hyperparams[’momentum'] * v + p.grad 
pL:] -= hyperparams['1r'] x v 
p.grad.data.zero_() 


Let’s see how this works in practice. 


def train_momentum(lr, momentum, num_epochs=2): 
d2l.train_ch11(sgd_momentum, init_momentum_states(feature_dim), 
{'l1r': lr, ‘momentum’: momentum}, data_iter, 
feature_dim, num_epochs) 


data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
train_momentum(@.@2, 0.5) 


loss: 0.245, 0.153 sec/epoch 
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When we increase the momentum hyperparameter momentum to 0.9, it amounts to a signif- 
icantly larger effective sample size of mw = 10. We reduce the learning rate slightly to 
0.01 to keep matters under control. 


train_momentum(@.01, 2.9) 


loss: 0.248, 0.109 sec/epoch 


Reducing the learning rate further addresses any issue of non-smooth optimization prob- 
lems. Setting it to 0.005 yields good convergence properties. 


train_momentum(@.005, @.9) 


loss: 0.243, 9.107 sec/epoch 
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Concise Implementation 


There is very little to do in Gluon since the standard sgd solver already had momentum 
built in. Setting matching parameters yields a very similar trajectory. 


trainer = torch.optim.SGD 
d2l.train_concise_ch11(trainer, {'lr': 0.005, ‘momentum’: 0.9}, data_iter) 


loss: 0.250, 9.108 sec/epoch 


loss 


12.6.3 Theoretical Analysis 


So far the 2D example of f(x) = 0. 1x? +2x2 seemed rather contrived. We will now see that 
this is actually quite representative of the types of problem one might encounter, at least in 
the case of minimizing convex quadratic objective functions. 
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Quadratic Convex Functions 


Consider the function 
1 
h(x) = z% Qx+x"c+b. (12.6.6) 


This is a general quadratic function. For positive definite matrices Q > 0, i.e., for matrices 
with positive eigenvalues this has a minimizer at x* = —-Q7!c with minimum value b — 
5c°Qule. Hence we can rewrite h as 


h(x) = La -Qto Q(x- Qc) +b- Letgo. (12.6.7) 


The gradient is given by 0,h(x) = Q(x — Q7!ec). That is, it is given by the distance 
between x and the minimizer, multiplied by Q. Consequently also the velocity is a linear 
combination of terms Q(x; — Qv!c). 


Since Q is positive definite it can be decomposed into its eigensystem via Q = O' AO for 
an orthogonal (rotation) matrix O and a diagonal matrix A of positive eigenvalues. This 


i def ; 
allows us to perform a change of variables from x to z = O(x - Q7!c) to obtain a much 
simplified expression: 


1 
h(z) = zZ Az tb’. (12.6.8) 


Here b’ = b - 5c°Qvle. Since O is only an orthogonal matrix this does not perturb the 
gradients in a meaningful way. Expressed in terms of z gradient descent becomes 


Zt = Z-1 — Adz) = (I- A)z,-1. (12.6.9) 


The important fact in this expression is that gradient descent does not mix between different 
eigenspaces. That is, when expressed in terms of the eigensystem of Q the optimization 
problem proceeds in a coordinate-wise manner. This also holds for 
V: = Bv;-1 + AZ-1 
Zt = Z-1 — N (Bv;-1 + Azz-1) (12.6.10) 
= (I- nA)z,-1 — nBv;-1. 
In doing this we just proved the following theorem: gradient descent with and without 


momentum for a convex quadratic function decomposes into coordinate-wise optimization 
in the direction of the eigenvectors of the quadratic matrix. 


Scalar Functions 
Given the above result let’s see what happens when we minimize the function f(x) = Ax? 


For gradient descent we have 
Xt+1 = Xt — AX, = (1 — nA)xr. (12.6.11) 


Whenever |1 —74| < 1 this optimization converges at an exponential rate since after t steps 
we have x, = (1 — 7A)'xo. This shows how the rate of convergence improves initially as 
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we increase the learning rate 7 until 7A = 1. Beyond that things diverge and for 7A > 2 the 
optimization problem diverges. 


lambdas = [@.1, 1, 10, 19] 
eta = 0.1 
d21.set_figsize((6, 4)) 
for lam in lambdas: 
t = torch. arange(20) .detach() .numpy() 
d21.plt.plot(t, (1 - eta x lam) ** t, label=f'’lambda = {lam: .2f}’) 
d21.plt.xlabel(' time’) 
d21.plt.legend(); 
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To analyze convergence in the case of momentum we begin by rewriting the update equa- 
tions in terms of two scalars: one for x and one for velocity v. This yields: 
Vt+1 B a | | K 
= = R(6,7,a ; 
-n6 (- na) Xt (Bm 4) Xt 
We used R to denote the 2 x 2 governing convergence behavior. After ¢ steps the initial 
choice [vo, xo] becomes R(£, 7, 2)‘ [vo, xo]. Hence, it is up to the eigenvalues of R. to 
determine the speed of convergence. See the Distill post !”° of Goh (2017) for a great 


animation and Flammarion and Bach (2015) for a detailed analysis. One can show that 
0 < nA < 2426 velocity converges. This is a larger range of feasible parameters when 


(12.6.12) 


compared to 0 < 7A < 2 for gradient descent. It also suggests that in general large values 
of $ are desirable. Further details require a fair amount of technical detail and we suggest 
that the interested reader consult the original publications. 


12.6.4 Summary 


e Momentum replaces gradients with a leaky average over past gradients. This accelerates 
convergence significantly. 


e Itis desirable for both noise-free gradient descent and (noisy) stochastic gradient descent. 


e Momentum prevents stalling of the optimization process that is much more likely to 
occur for stochastic gradient descent. 
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The effective number of gradients is given by ra due to exponentiated downweighting 
of past data. 


In the case of convex quadratic problems this can be analyzed explicitly in detail. 


Implementation is quite straightforward but it requires us to store an additional state 
vector (velocity v). 


12.6.5 Exercises 


1. Use other combinations of momentum hyperparameters and learning rates and observe 
and analyze the different experimental results. 


2. Try out gradient descent and momentum for a quadratic problem where you have multi- 
ple eigenvalues, i.e., f(x) = 5 Di Aix?, e.g., A; = 27". Plot how the values of x decrease 
for the initialization x; = 1. 


3. Derive minimum value and minimizer for h(x) = 5x'Qx +x'c+b. 


4. What changes when we perform stochastic gradient descent with momentum? What 
happens when we use minibatch stochastic gradient descent with momentum? Experi- 
ment with the parameters? 
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12.7 Adagrad 
—<—<$<$<$<<———————————————————— —_—_____4__ i 


Let’s begin by considering learning problems with features that occur infrequently. 


12.7.1 Sparse Features and Learning Rates 


Imagine that we are training a language model. To get good accuracy we typically want 
to decrease the learning rate as we keep on training, usually at a rate of O (t7 ) or slower. 
Now consider a model training on sparse features, i.e., features that occur only infrequently. 
This is common for natural language, e.g., it is a lot less likely that we will see the word 
preconditioning than learning. However, it is also common in other areas such as computa- 
tional advertising and personalized collaborative filtering. After all, there are many things 
that are of interest only for a small number of people. 


Parameters associated with infrequent features only receive meaningful updates whenever 
these features occur. Given a decreasing learning rate we might end up in a situation 
where the parameters for common features converge rather quickly to their optimal values, 
whereas for infrequent features we are still short of observing them sufficiently frequently 
before their optimal values can be determined. In other words, the learning rate either 
decreases too slowly for frequent features or too quickly for infrequent ones. 
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A possible hack to redress this issue would be to count the number of times we see a par- 
ticular feature and to use this as a clock for adjusting learning rates. That is, rather than 


choosing a learning rate of the form 7 = — we could use n; = —~2—. Here s(i,t) 
Vite : Vs(i,t)+e á 


counts the number of nonzeros for feature i that we have observed up to time t. This is ac- 
tually quite easy to implement at no meaningful overhead. However, it fails whenever we 
do not quite have sparsity but rather just data where the gradients are often very small and 
only rarely large. After all, it is unclear where one would draw the line between something 
that qualifies as an observed feature or not. 


Adagrad by Duchi et al. (2011) addresses this by replacing the rather crude counter s(i, t) 
by an aggregate of the squares of previously observed gradients. In particular, it uses 
s(i,t+1) = s(i,t)+ (0; f (x))? as a means to adjust the learning rate. This has two benefits: 
first, we no longer need to decide just when a gradient is large enough. Second, it scales 
automatically with the magnitude of the gradients. Coordinates that routinely correspond 
to large gradients are scaled down significantly, whereas others with small gradients re- 
ceive a much more gentle treatment. In practice this leads to a very effective optimization 
procedure for computational advertising and related problems. But this hides some of the 
additional benefits inherent in Adagrad that are best understood in the context of precondi- 
tioning. 


12.7.2 Preconditioning 


Convex optimization problems are good for analyzing the characteristics of algorithms. 
After all, for most nonconvex problems it is difficult to derive meaningful theoretical guar- 
antees, but intuition and insight often carry over. Let’s look at the problem of minimizing 
f(x) = 5x'Qx +c'x+b. 


As we saw in Section 12.6, it is possible to rewrite this problem in terms of its eigendecom- 
position Q = U' AU to arrive at a much simplified problem where each coordinate can be 
solved individually: 


f(x) =f(®) = SRT AR+ clx+b. (12.7.1) 


Here we used x = Ux and consequently ¢ = Uc. The modified problem has as its min- 
imizer x = —A~'@ and minimum value -4TA + b. This is much easier to compute 
since A is a diagonal matrix containing the eigenvalues of Q. 


If we perturb c slightly we would hope to find only slight changes in the minimizer of f. 
Unfortunately this is not the case. While slight changes in c lead to equally slight changes 
in ©, this is not the case for the minimizer of f (and of f respectively). Whenever the 
eigenvalues A; are large we will see only small changes in x; and in the minimum of f. 
Conversely, for small A; changes in x; can be dramatic. The ratio between the largest and 
the smallest eigenvalue is called the condition number of an optimization problem. 

k= Ay 
If the condition number x is large, it is difficult to solve the optimization problem accurately. 
We need to ensure that we are careful in getting a large dynamic range of values right. Our 


(12.7.2) 
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analysis leads to an obvious, albeit somewhat naive question: couldn’t we simply “fix” the 
problem by distorting the space such that all eigenvalues are 1. In theory this is quite easy: 
we only need the eigenvalues and eigenvectors of Q to rescale the problem from x to one 
in z © A? Ux. In the new coordinate system x" Qx could be simplified to ||z||?. Alas, this 
is a rather impractical suggestion. Computing eigenvalues and eigenvectors is in general 
much more expensive than solving the actual problem. 


While computing eigenvalues exactly might be expensive, guessing them and computing 
them even somewhat approximately may already be a lot better than not doing anything at 
all. In particular, we could use the diagonal entries of Q and rescale it accordingly. This is 
much cheaper than computing eigenvalues. 


Q = diag~?(Q)Qdiag~2(Q). (12.7.3) 


In this case we have Qi; = Q;;/¥QiQ;; and specifically Qi; = 1 for all i. In most cases 
this simplifies the condition number considerably. For instance, the cases we discussed 
previously, this would entirely eliminate the problem at hand since the problem is axis 
aligned. 


Unfortunately we face yet another problem: in deep learning we typically do not even have 
access to the second derivative of the objective function: for x € Rf the second derivative 
even on a minibatch may require O (d?) space and work to compute, thus making it practi- 
cally infeasible. The ingenious idea of Adagrad is to use a proxy for that elusive diagonal 
of the Hessian that is both relatively cheap to compute and effective—the magnitude of the 
gradient itself. 


In order to see why this works, let’s look at f(X). We have that 
Ox f (F) = AK+% = A(K- Xo), (12.7.4) 


where Xo is the minimizer of f. Hence the magnitude of the gradient depends both on A 
and the distance from optimality. If x — Xo did not change, this would be all that is needed. 
After all, in this case the magnitude of the gradient ôg f(X) suffices. Since AdaGrad is a 
stochastic gradient descent algorithm, we will see gradients with nonzero variance even at 
optimality. As a result we can safely use the variance of the gradients as a cheap proxy for 
the scale of the Hessian. A thorough analysis is beyond the scope of this section (it would 
be several pages). We refer the reader to (Duchi et al., 2011) for details. 


12.7.3 The Algorithm 


Let’s formalize the discussion from above. We use the variable s; to accumulate past gra- 
dient variance as follows. 


gr = Owl (yt, f (Xr, W)), 
Sr = Sr-1 + 8, (12.7.5) 


We Wr Se 
S, +€ 
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Here the operation are applied coordinate wise. That is, v? has entries ve Likewise has 
entries _ and u- v has entries u;v;. As before 7 is the learning rate and € is an additive 


Wi 

constant that ensures that we do not divide by 0. Last, we initialize sọ = 0. 

Just like in the case of momentum we need to keep track of an auxiliary variable, in this 
case to allow for an individual learning rate per coordinate. This does not increase the cost 
of Adagrad significantly relative to SGD, simply since the main cost is typically to compute 
l(yr, f (Xt, w)) and its derivative. 


Note that accumulating squared gradients in s; means that s; grows essentially at linear rate 
(somewhat slower than linearly in practice, since the gradients initially diminish). This 
leads to an O(t-2) learning rate, albeit adjusted on a per coordinate basis. For convex 
problems this is perfectly adequate. In deep learning, though, we might want to decrease 
the learning rate rather more slowly. This led to a number of Adagrad variants that we will 
discuss in the subsequent chapters. For now let’s see how it behaves in a quadratic convex 
problem. We use the same problem as before: 


f(x) = 0.1x? + 2x2. (12.7.6) 


We are going to implement Adagrad using the same learning rate previously, i.e., 7 = 0.4. 
As we can see, the iterative trajectory of the independent variable is smoother. However, 
due to the cumulative effect of s+, the learning rate continuously decays, so the independent 
variable does not move as much during later stages of iteration. 


%matplotlib inline 

import math 

import torch 

from d21 import torch as d21 


def adagrad_2d(x1, x2, sl, s2): 
eps = le-6 
Mil, G2 = O.2 &sdl, 4 & sa 
sl += gl xx 2 
S2 += g2 xx 2 
xl -= eta / math.sqrt(s1 + eps) * gl 
x2 -= eta / math.sqrt(s2 + eps) * g2 
return X25 Gil, Se 


def f_2d(x1, x2): 
return @.1 * xl ** 2 + 2 * x2 xx 2 


eta = 0.4 
d21.show_trace_2d(f_2d, d21.train_2d(adagrad_2d)) 


epoch 20, x1: -2.382563, x2: -0.158591 


As we increase the learning rate to 2 we see much better behavior. This already indicates 
that the decrease in learning rate might be rather aggressive, even in the noise-free case and 
we need to ensure that parameters converge appropriately. 
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eta = 2 
d21.show_trace_2d(f_2d, d21.train_2d(adagrad_2d)) 


epoch 20, x1: -@.002295, x2: -0.000000 


12.7.4 Implementation from Scratch 


Just like the momentum method, Adagrad needs to maintain a state variable of the same 
shape as the parameters. 


def init_adagrad_states(feature_dim): 
s_w = torch.zeros((feature_dim, 1)) 
s_b = torch.zeros(1) 
return (s_w, s_b) 


def adagrad(params, states, hyperparams): 
eps = le-6 
for p, s in zip(params, states): 
with torch.no_grad(): 
s[:] += torch. square(p. grad) 
pL:] -= hyperparams['lr'] x p.grad / torch.sqrt(s + eps) 
p.grad.data.zero_() 


Compared to the experiment in Section 12.5 we use a larger learning rate to train the 
model. 
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data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
d21.train_chli(adagrad, init_adagrad_states(feature_dim) , 
{'lr': @.1}, data_iter, feature_dim) ; 


loss: @.243, 0.162 sec/epoch 


0.350 
0.325 4 
0.300 4 
2) 
a) 
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0.225 4 | | | 
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12.7.5 Concise Implementation 


Using the Trainer instance of the algorithm adagrad, we can invoke the Adagrad algo- 
rithm in Gluon. 


trainer = torch.optim.Adagrad 
d2l.train_concise_ch11(trainer, {'lr': 0.1}, data_iter) 


loss: 0.242, @.129 sec/epoch 


loss 


12.7.6 Summary 


e Adagrad decreases the learning rate dynamically on a per-coordinate basis. 


e It uses the magnitude of the gradient as a means of adjusting how quickly progress is 
achieved - coordinates with large gradients are compensated with a smaller learning 
rate. 
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e Computing the exact second derivative is typically infeasible in deep learning problems 
due to memory and computational constraints. The gradient can be a useful proxy. 


e Ifthe optimization problem has a rather uneven structure Adagrad can help mitigate the 
distortion. 


e Adagrad is particularly effective for sparse features where the learning rate needs to de- 
crease more slowly for infrequently occurring terms. 


e On deep learning problems Adagrad can sometimes be too aggressive in reducing learn- 
ing rates. We will discuss strategies for mitigating this in the context of Section 12.10. 


12.7.7 Exercises 


1. Prove that for an orthogonal matrix U and a vector c the following holds: ||c — ffil|2 = 
|| Uc— Uffill2. Why does this mean that the magnitude of perturbations does not change 
after an orthogonal change of variables? 


2. Try out Adagrad for f(x) = 0. lx? + 2x2 and also for the objective function was rotated 
by 45 degrees, i.e., f(x) = 0.1 (x1 + x2)? +2(x1ı —x2)?. Does it behave differently? 


3. Prove Gerschgorin’s circle theorem!” which states that eigenvalues 4; of a matrix M 
satisfy |A: - Mj;| < Dig; [Myx| for at least one choice of j. 


4. What does Gerschgorin’s theorem tell us about the eigenvalues of the diagonally pre- 
conditioned matrix diag 2 (M) Mdiag~? (M)? 


5. Try out Adagrad for a proper deep network, such as Section 7.6 when applied to Fashion- 
MNIST. 


6. How would you need to modify Adagrad to achieve a less aggressive decay in learning 
rate? 


Discussions !78. 


12.8 RMSProp 
Ea 


One of the key issues in Section 12.7 is that the learning rate decreases at a predefined 
schedule of effectively O (t7 2), While this is generally appropriate for convex problems, 
it might not be ideal for nonconvex ones, such as those encountered in deep learning. Yet, 
the coordinate-wise adaptivity of Adagrad is highly desirable as a preconditioner. 


Tieleman and Hinton (2012) proposed the RMSProp algorithm as a simple fix to decouple 
rate scheduling from coordinate-adaptive learning rates. The issue is that Adagrad accu- 
mulates the squares of the gradient g; into a state vector s; = S+-1 + g2. As a result s; 
keeps on growing without bound due to the lack of normalization, essentially linearly as 
the algorithm converges. 
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One way of fixing this problem would be to use s,/t. For reasonable distributions of g; 
this will converge. Unfortunately it might take a very long time until the limit behavior 
starts to matter since the procedure remembers the full trajectory of values. An alternative 
is to use a leaky average in the same way we used in the momentum method, i.e., s; — 
ys:-1 + (1 — y)g? for some parameter y > 0. Keeping all other parts unchanged yields 
RMSProp. 


12.8.1 The Algorithm 


Let’s write out the equations in detail. 


Sr — ys;-1+ (1 - y)g 
12.8.1 
6 ooe ( ) 
S; +E 

The constant € > 0 is typically set to 107% to ensure that we do not suffer from division by 
zero or overly large step sizes. Given this expansion we are now free to control the learning 
rate 7 independently of the scaling that is applied on a per-coordinate basis. In terms of 
leaky averages we can apply the same reasoning as previously applied in the case of the 
momentum method. Expanding the definition of s; yields 


s: = (1 =- y)g; + Y8r-1 
(12.8.2) 
=(1-7y) (e? + ye? + ygt): 
As before in Section 12.6 we use 1 +y +y? +...,= Iy Hence the sum of weights is 
normalized to 1 with a half-life time of an observation of y~!. Let’s visualize the weights 
for the past 40 time steps for various choices of y. 


import math 
import torch 
from d21 import torch as d21 


d21.set_figsize() 
gammas = [0.95, 0.9, 0.8, 0.7] 
for gamma in gammas: 
x = torch. arange(4Q) .detach() .numpy() 
d21.plt.plot(x, (1-gamma) * gamma ** x, label=f’gamma = {gamma: .2f}’) 
d21.plt.xlabel('time’); 


12.8.2 Implementation from Scratch 


As before we use the quadratic function f(x) = 0.1x? + ie to observe the trajectory of 
RMSProp. Recall that in Section 12.7, when we used Adagrad with a learning rate of 
0.4, the variables moved only very slowly in the later stages of the algorithm since the 
learning rate decreased too quickly. Since 77 is controlled separately this does not happen 
with RMSProp. 
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def rmsprop_2d(x1, x2, sl, s2): 
gl, g2, eps = 0.2 * x1, 4 x x2, le-6 
sl = gamma * sl + (1 - gamma) * gl xx 2 
s2 = gamma * s2 + (1 - gamma) * g2 xx 2 
x1 -= eta / math.sqrt(sl + eps) * gl 
x2 -= eta / math.sqrt(s2 + eps) * g2 
return x1, x2, sl, s2 


def f_2d(x1, x2): 
ReCUIEM Clk eX AD a XD a A 


eta, gamma = 0.4, 0.9 
d21.show_trace_2d(f_2d, d21.train_2d(rmsprop_2d)) 


epoch 20, x1: -0.010599, x2: 0.900000 


Next, we implement RMSProp to be used in a deep network. This is equally straightfor- 
ward. 


def init_rmsprop_states(feature_dim) : 
s_w = torch.zeros((feature_dim, 1)) 
s_b = torch. zeros(1) 
return (s_w, s_b) 


def rmsprop(params, states, hyperparams): 
gamma, eps = hyperparams['gamma'], le-6 


(continues on next page) 
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for p, s in zip(params, states): 
with torch.no_grad(): 
s[:] = gamma x s + (1 - gamma) * torch.square(p. grad) 
pL:] -= hyperparams['1r'] * p.grad / torch.sqrt(s + eps) 
p.grad.data.zero_() 


We set the initial learning rate to 0.01 and the weighting term y to 0.9. That is, s aggregates 
on average over the past 1/(1 — y) = 10 observations of the square gradient. 


data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
d21.train_chli(rmsprop, init_rmsprop_states(feature_dim) , 
{'lr': 0.01, ‘gamma’: 0.9}, data_iter, feature_dim) ; 


loss: 0.245, 0.245 sec/epoch 
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12.8.3 Concise Implementation 


Since RMSProp is a rather popular algorithm it is also available in the Trainer instance. 
All we need to do is instantiate it using an algorithm named rmsprop, assigning y to the 
parameter gamma1. 


trainer = torch.optim.RMSprop 
d2l.train_concise_ch11(trainer, {'lr': 0.01, ‘alpha’: 0.9}, 
data_iter) 


loss: 0.246, 0.129 sec/epoch 


12.8.4 Summary 


e RMSProp is very similar to Adagrad insofar as both use the square of the gradient to 
scale coefficients. 


e RMSProp shares with momentum the leaky averaging. However, RMSProp uses the 
technique to adjust the coefficient-wise preconditioner. 
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e The learning rate needs to be scheduled by the experimenter in practice. 


e The coefficient y determines how long the history is when adjusting the per-coordinate 
scale. 


12.8.5 Exercises 
1. What happens experimentally if we set y = 1? Why? 


2. Rotate the optimization problem to minimize f(x) = 0.1(x; +x2)* +2(x1 —x2)?. What 
happens to the convergence? 


3. Try out what happens to RMSProp on a real machine learning problem, such as training 
on Fashion-MNIST. Experiment with different choices for adjusting the learning rate. 


4. Would you want to adjust y as optimization progresses? How sensitive is RMSProp to 
this? 


Discussions!79. 


12.9 Adadelta 
E 


Adadelta is yet another variant of AdaGrad (Section 12.7). The main difference lies in 
the fact that it decreases the amount by which the learning rate is adaptive to coordinates. 
Moreover, traditionally it referred to as not having a learning rate since it uses the amount of 
change itself as calibration for future change. The algorithm was proposed in Zeiler (2012). 
It is fairly straightforward, given the discussion of previous algorithms so far. 


12.9.1 The Algorithm 


In a nutshell, Adadelta uses two state variables, s; to store a leaky average of the second 
moment of the gradient and Ax; to store a leaky average of the second moment of the change 
of parameters in the model itself. Note that we use the original notation and naming of the 
authors for compatibility with other publications and implementations (there is no other 
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real reason why one should use different Greek variables to indicate a parameter serving 
the same purpose in momentum, Adagrad, RMSProp, and Adadelta). 


Here are the technical details of Adadelta. Given the parameter du jour is p, we obtain the 
following leaky updates similarly to Section 12.8: 


S; = pS;-1 + (1 — p)g?. (12.9.1) 


The difference to Section 12.8 is that we perform updates with the rescaled gradient g/, 
i.e., 


X = Xy-1 — gi. (12.9.2) 


So what is the rescaled gradient g,? We can calculate it as follows: 


VAx;-1 + 
gi = + og, (12.9.3) 


V/s; + € 


where Ax;_; is the leaky average of the squared rescaled gradients g;. We initialize Axo 
to be 0 and update it at each step with gj, i.e., 


Ax; = pAx;-1 + (1- p)g’”, (12.9.4) 


and e (a small value such as 1075) is added to maintain numerical stability. 


12.9.2 Implementation 


Adadelta needs to maintain two state variables for each variable, s; and Ax;. This yields 
the following implementation. 


%matplotlib inline 
import torch 
from d21 import torch as d21 


def init_adadelta_states(feature_dim): 
s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1) 
delta_w, delta_b = torch.zeros((feature_dim, 1)), torch.zeros(1) 
return ((s_w, delta_w), (s_b, delta_b)) 


def adadelta(params, states, hyperparams): 
rho, eps = hyperparams['rho'’], le-5 
for p, (s, delta) in zip(params, states): 
with torch.no_grad(): 
# In-place updates via [:] 
s[:] = rho x s + (1 - rho) * torch.square(p. grad) 
g = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * p.grad 
pL:] -= 
delta[:] = rho * delta + (1 - rho) * g> g 
p.grad.data.zero_() 


Choosing p = 0.9 amounts to a half-life time of 10 for each parameter update. This tends 
to work quite well. We get the following behavior. 
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data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
d21.train_chli(adadelta, init_adadelta_states(feature_dim) , 
{'rho': 0.9}, data_iter, feature_dim); 


loss: 0.245, 0.160 sec/epoch 
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For a concise implementation we simply use the Adadelta algorithm from high-level APIs. 
This yields the following one-liner for a much more compact invocation. 


trainer = torch.optim.Adadelta 
d21.train_concise_chl11(trainer, {’rho': 0.9}, data_iter) 


loss: @.243, 0.119 sec/epoch 


loss 


12.9.3 Summary 


e Adadelta has no learning rate parameter. Instead, it uses the rate of change in the param- 
eters itself to adapt the learning rate. 


e Adadelta requires two state variables to store the second moments of gradient and the 
change in parameters. 


e Adadelta uses leaky averages to keep a running estimate of the appropriate statistics. 
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12.9.4 Exercises 
1. Adjust the value of pọ. What happens? 


2. Show how to implement the algorithm without the use of g. Why might this be a good 
idea? 


3. Is Adadelta really learning rate free? Could you find optimization problems that break 
Adadelta? 


4. Compare Adadelta to Adagrad and RMS prop to discuss their convergence behavior. 


Discussions !®°. 


12.10 Adam 


In the discussions leading up to this section we encountered a number of techniques for 
efficient optimization. Let’s recap them in detail here: 


e We saw that Section 12.4 is more effective than Gradient Descent when solving opti- 
mization problems, e.g., due to its inherent resilience to redundant data. 


e We saw that Section 12.5 affords significant additional efficiency arising from vector- 
ization, using larger sets of observations in one minibatch. This is the key to efficient 
multi-machine, multi-GPU and overall parallel processing. 


e Section 12.6 added a mechanism for aggregating a history of past gradients to accelerate 
convergence. 


e Section 12.7 used per-coordinate scaling to allow for a computationally efficient precon- 
ditioner. 


e Section 12.8 decoupled per-coordinate scaling from a learning rate adjustment. 


Adam (Kingma and Ba, 2014) combines all these techniques into one efficient learning 
algorithm. As expected, this is an algorithm that has become rather popular as one of the 
more robust and effective optimization algorithms to use in deep learning. It is not without 
issues, though. In particular, (Reddi et al., 2019) show that there are situations where Adam 
can diverge due to poor variance control. In a follow-up work Zaheer et al. (2018) proposed 
a hotfix to Adam, called Yogi which addresses these issues. More on this later. For now 
let’s review the Adam algorithm. 


12.10.1 The Algorithm 


One of the key components of Adam is that it uses exponential weighted moving averages 
(also known as leaky averaging) to obtain an estimate of both the momentum and also the 
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second moment of the gradient. That is, it uses the state variables 


v: — Bivi-1+ (1 - Bi)gr, 

2 

S; — B28;-1 + (1 — f2)g;. 

Here £; and £2 are nonnegative weighting parameters. Common choices for them are 


fi = 0.9 and £2 = 0.999. That is, the variance estimate moves much more slowly than the 
momentum term. Note that if we initialize vo = so = 0 we have a significant amount of bias 


(12.10.1) 


initially towards smaller values. This can be addressed by using the fact that a p= ss 
to re-normalize terms. Correspondingly the normalized state variables are given by 
St 
M and 8; = 12.10.2 
1-1-8 ( ) 


Armed with the proper estimates we can now write out the update equations. First, we 
rescale the gradient in a manner very much akin to that of RMSProp to obtain 


g = 2. 

V8; +€ 

Unlike RMSProp our update uses the momentum ¥;, rather than the gradient itself. More- 

over, there is a slight cosmetic difference as the rescaling happens using Jen 

Jae The former works arguably slightly better in practice, hence the deviation from RM- 

SProp. Typically we pick € = 1076 for a good trade-off between numerical stability and 
fidelity. 


(12.10.3) 


instead of 


Now we have all the pieces in place to compute updates. This is slightly anticlimactic and 
we have a simple update of the form 


X; — X-1 — gy. (12.10.4) 


Reviewing the design of Adam its inspiration is clear. Momentum and scale are clearly 
visible in the state variables. Their rather peculiar definition forces us to debias terms 
(this could be fixed by a slightly different initialization and update condition). Second, the 
combination of both terms is pretty straightforward, given RMSProp. Last, the explicit 
learning rate 7 allows us to control the step length to address issues of convergence. 


12.10.2 Implementation 


Implementing Adam from scratch is not very daunting. For convenience we store the time 
step counter ¢ in the hyperparams dictionary. Beyond that all is straightforward. 


zmatplotlib inline 
import torch 
from d21 import torch as d21 


def init_adam_states(feature_dim) : 
v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1) 
s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1) 
return ((v_w, s_w), (v_b, s_b)) 


(continues on next page) 
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def adam(params, states, hyperparams): 
betal, beta2, eps = 0.9, 0.999, le-6 
for p, (v, s) in zip(params, states): 
with torch.no_grad(): 
vL:] = betal * v + (1 - betal) * p.grad 
s[:] = beta2 x s + (1 - beta2) * torch.square(p. grad) 
v_bias_corr = v / (1 - betal ** hyperparams[’t’]) 
s_bias_corr = s / (1 - beta2 ** hyperparams[’t’]) 
pL:] -= hyperparams['1r'] * v_bias_corr / (torch.sqrt(s_bias_corr) 
+ eps) 
p.grad.data.zero_() 
hyperparams['t'] += 1 


We are ready to use Adam to train the model. We use a learning rate of 7 = 0.01. 


data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
d21.train_chli(adam, init_adam_states(feature_dim), 
{'lr': 0.01, 't’: 1}, data_iter, feature_dim) ; 


loss: @.243, 0.193 sec/epoch 
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A more concise implementation is straightforward since adam is one of the algorithms pro- 
vided as part of the Gluon trainer optimization library. Hence we only need to pass 
configuration parameters for an implementation in Gluon. 


trainer = torch.optim.Adam 
d2l.train_concise_ch11(trainer, {'lr': 0.01}, data_iter) 


loss: 0.243, 0.152 sec/epoch 


12.10.3 Yogi 


One of the problems of Adam is that it can fail to converge even in convex settings when the 
second moment estimate in s; blows up. As a fix Zaheer et al. (2018) proposed a refined 
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update (and initialization) for s;. To understand what’s going on, let’s rewrite the Adam 
update as follows: 


s: — 8-1 + (1 — Bo) (g?-s,-1). (12.10.5) 


Whenever g? has high variance or updates are sparse, s, might forget past values too quickly. 
A possible fix for this is to replace g?—s,_ by g7@sgn(g?—s,_;). Now the magnitude of the 
update no longer depends on the amount of deviation. This yields the Yogi updates 


s; — S;-1+ (1 — fo)g? © sgn(g? — s;-1). (12.10.6) 


The authors furthermore advise to initialize the momentum on a larger initial batch rather 
than just initial pointwise estimate. We omit the details since they are not material to the 
discussion and since even without this convergence remains pretty good. 


def yogi(params, states, hyperparams): 
betal, beta2, eps = 0.9, 0.999, le-3 
for p, (v, s) in zip(params, states): 
with torch.no_grad(): 
vL:] = betal x v + (1 - betal) * p.grad 
s[:] = s + (1 - beta2) x torch.sign( 
torch.square(p.grad) - s) * torch.square(p.grad) 
v_bias_corr = v / (1 - betal ** hyperparams[’t’]) 
s_bias_corr = s / (1 - beta2 ** hyperparams[’t’]) 
pL:] -= hyperparams['1r'] * v_bias_corr / (torch.sqrt(s_bias_corr) 
+ eps) 
p.grad.data.zero_() 
hyperparams['t'] += 1 


data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 


d21.train_chli(yogi, init_adam_states(feature_dim) , 
{'lr': 0.01, 't’: 1}, data_iter, feature_dim); 


loss: 0.243, 9.165 sec/epoch 


12.10.4 Summary 


e Adam combines features of many optimization algorithms into a fairly robust update 
rule. 
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Created on the basis of RMSProp, Adam also uses EWMA on the minibatch stochastic 
gradient. 


Adam uses bias correction to adjust for a slow startup when estimating momentum and 
a second moment. 


For gradients with significant variance we may encounter issues with convergence. They 
can be amended by using larger minibatches or by switching to an improved estimate 
for s;. Yogi offers such an alternative. 


12.10.5 Exercises 


1. Adjust the learning rate and observe and analyze the experimental results. 


2. Can you rewrite momentum and second moment updates such that it does not require 
bias correction? 


3. Why do you need to reduce the learning rate 7 as we converge? 


4. Try to construct a case for which Adam diverges and Yogi converges? 


Discussions !®! 


12.11 Learning Rate Scheduling 
LL —————————— 


So far we primarily focused on optimization algorithms for how to update the weight vectors 
rather than on the rate at which they are being updated. Nonetheless, adjusting the learning 
rate is often just as important as the actual algorithm. There are a number of aspects to 
consider: 


e Most obviously the magnitude of the learning rate matters. If it is too large, optimization 
diverges, if it is too small, it takes too long to train or we end up with a suboptimal 
result. We saw previously that the condition number of the problem matters (see e.g., 
Section 12.6 for details). Intuitively it is the ratio of the amount of change in the least 
sensitive direction vs. the most sensitive one. 
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e Secondly, the rate of decay is just as important. If the learning rate remains large we may 
simply end up bouncing around the minimum and thus not reach optimality. Section 
12.5 discussed this in some detail and we analyzed performance guarantees in Section 
12.4. In short, we want the rate to decay, but probably more slowly than O(t7 ) which 
would be a good choice for convex problems. 


e Another aspect that is equally important is initialization. This pertains both to how the 
parameters are set initially (review Section 5.4 for details) and also how they evolve 
initially. This goes under the moniker of warmup, i.e., how rapidly we start moving 
towards the solution initially. Large steps in the beginning might not be beneficial, in 
particular since the initial set of parameters is random. The initial update directions 
might be quite meaningless, too. 


e Lastly, there are a number of optimization variants that perform cyclical learning rate 
adjustment. This is beyond the scope of the current chapter. We recommend the 
reader to review details in Izmailov et al. (2018), e.g., how to obtain better solutions 
by averaging over an entire path of parameters. 


Given the fact that there is a lot of detail needed to manage learning rates, most deep learn- 
ing frameworks have tools to deal with this automatically. In the current chapter we will 
review the effects that different schedules have on accuracy and also show how this can be 
managed efficiently via a learning rate scheduler. 


12.11.1 Toy Problem 


We begin with a toy problem that is cheap enough to compute easily, yet sufficiently non- 
trivial to illustrate some of the key aspects. For that we pick a slightly modernized version 
of LeNet (relu instead of sigmoid activation, MaxPooling rather than AveragePooling), 
as applied to Fashion-MNIST. Moreover, we hybridize the network for performance. Since 
most of the code is standard we just introduce the basics without further detailed discussion. 
See Chapter 7 for a refresher as needed. 


%matplotlib inline 

import math 

import torch 

from torch import nn 

from torch.optim import lr_scheduler 
from d21 import torch as d21 


def net_fn(): 

model = nn.Sequential( 
nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.ReLU(), 
nn.MaxPool2d(kernel_size=2, stride=2), 
nn.Conv2d(6, 16, kernel_size=5), nn.ReLU(), 
nn.MaxPool2d(kernel_size=2, stride=2), 
nn.Flatten(), 
nn.Linear(16 * 5 * 5, 120), nn.ReLU(), 
nn.Linear(120@, 84), nn.ReLU(), 
nn.Linear(84, 10)) 


(continues on next page) 
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return model 


loss = nn.CrossEntropyLoss() 
device = d21.try_gpu() 


batch_size = 256 
train_iter, test_iter = d21.load_data_fashion_mnist (batch_size=batch_size) 


# The code is almost identical to ‘d21.train_ch6* defined in the 
# lenet section of chapter convolutional neural networks 
def train(net, train_iter, test_iter, num_epochs, loss, trainer, device, 
scheduler=None) : 
net. to(device) 
animator = d21.Animator(xlabel='epoch’, xlim=[@, num_epochs], 
legend=['train loss', ‘train acc’, ‘test acc’]) 


for epoch in range(num_epochs) : 
metric = d21.Accumulator(3) # train_loss, train_acc, num_examples 
for i, (X, y) in enumerate(train_iter): 
net. train() 
trainer.zero_grad() 
X, y = X.to(device), y.to(device) 
y_hat = net(X) 
1 = loss(y_hat, y) 
1. backward() 
trainer.step() 
with torch.no_grad(): 
metric.add(l * X.shape[@], d2l.accuracy(y_hat, y), X.shape[@]) 
train_loss = metric[@] / metric[2] 
train_acc = metric[1] / metric[2] 
if (i + 1) % 50 == 0: 
animator.add(epoch + i / len(train_iter), 
(train_loss, train_acc, None)) 


test_acc = d2l.evaluate_accuracy_gpu(net, test_iter) 
animator.add(epoch+1, (None, None, test_acc)) 


if scheduler: 
if scheduler.__module__ == Ir_scheduler.__name__: 
# Using PyTorch In-Built scheduler 
scheduler. step() 
else: 
# Using custom defined scheduler 
for param_group in trainer.param_groups: 
param_group['lr'] = scheduler (epoch) 
print(f'train loss {train_loss:.3f}, train acc {train_acc: .3f}, ' 
f'test acc {test_acc: .3f}’) 


Let’s have a look at what happens if we invoke this algorithm with default settings, such as 
a learning rate of 0.3 and train for 30 iterations. Note how the training accuracy keeps on 
increasing while progress in terms of test accuracy stalls beyond a point. The gap between 
both curves indicates overfitting. 
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lr, num_epochs = @.3, 30 

net = net_fn() 

trainer = torch.optim.SGD(net.parameters(), lr=lr) 

train(net, train_iter, test_iter, num_epochs, loss, trainer, device) 


train loss 0.145, train acc 0.944, test acc 0.877 


— train loss 
=-=- train acc 
—-- test acc 


12.11.2 Schedulers 


One way of adjusting the learning rate is to set it explicitly at each step. This is conve- 
niently achieved by the set_learning_rate method. We could adjust it downward after 
every epoch (or even after every minibatch), e.g., in a dynamic manner in response to how 
optimization is progressing. 


lr = @.1 
trainer.param_groupsLQ]["1lr"] = 1r 
print(f'learning rate is now {trainer.param_groups[Q][”"1r"]: .2f}') 


learning rate is now 0.10 


More generally we want to define a scheduler. When invoked with the number of updates 
it returns the appropriate value of the learning rate. Let’s define a simple one that sets the 
learning rate to 7 = yo(t + is 


class SquareRootScheduler: 
def __init__(self, 1r=0.1): 
self.lr = 1r 


def __call__(self, num_update): 
return self.lr * pow(num_update + 1.0, -0.5) 


Let’s plot its behavior over a range of values. 


scheduler = SquareRootScheduler(1r=@.1) 
d21.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs) ]) 
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Now let’s see how this plays out for training on Fashion-MNIST. We simply provide the 
scheduler as an additional argument to the training algorithm. 


net = net_fn() 

trainer = torch.optim.SGD(net.parameters(), 1r) 

train(net, train_iter, test_iter, num_epochs, loss, trainer, device, 
scheduler) 


train loss 0.273, train acc 0.900, test acc 0.886 


2.04 — train loss 

==> train acc 
is —-- test acc 
1.0 4 
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This worked quite a bit better than previously. Two things stand out: the curve was rather 
more smooth than previously. Secondly, there was less overfitting. Unfortunately it is not a 
well-resolved question as to why certain strategies lead to less overfitting in theory. There 
is some argument that a smaller stepsize will lead to parameters that are closer to zero and 
thus simpler. However, this does not explain the phenomenon entirely since we do not really 
stop early but simply reduce the learning rate gently. 


12.11.3 Policies 


While we cannot possibly cover the entire variety of learning rate schedulers, we attempt 
to give a brief overview of popular policies below. Common choices are polynomial decay 
and piecewise constant schedules. Beyond that, cosine learning rate schedules have been 
found to work well empirically on some problems. Lastly, on some problems it is beneficial 
to warm up the optimizer prior to using large learning rates. 
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Factor Scheduler 


One alternative to a polynomial decay would be a multiplicative one, that is 7441 — Mr ` œ 
for a € (0, 1). To prevent the learning rate from decaying beyond a reasonable lower bound 
the update equation is often modified to 7,4; — max (nmin; 71° @). 


class FactorScheduler: 
def __init__(self, factor=1, stop_factor_lr=1le-7, base_lr=0.1): 
self.factor = factor 
self.stop_factor_lr = stop_factor_Ilr 
self.base_lr = base_lr 


def __call__(self, num_update): 
self.base_lr = max(self.stop_factor_lr, self.base_lr x self.factor) 
return self.base_Ir 


scheduler = FactorScheduler(factor=0.9, stop_factor_lr=le-2, base_lr=2.0) 
d21.plot(torch.arange(5@), [scheduler(t) for t in range(50)]) 
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This can also be accomplished by a built-in scheduler in MXNet via the 1r_scheduler. 
FactorScheduler object. It takes a few more parameters, such as warmup period, warmup 
mode (linear or constant), the maximum number of desired updates, etc.; Going forward 
we will use the built-in schedulers as appropriate and only explain their functionality here. 
As illustrated, it is fairly straightforward to build your own scheduler if needed. 


Multi Factor Scheduler 


A common strategy for training deep networks is to keep the learning rate piecewise con- 
stant and to decrease it by a given amount every so often. That is, given a set of times 
when to decrease the rate, such as s = {5, 10,20} decrease +4; — 7; -a@ whenever t € s. 
Assuming that the values are halved at each step we can implement this as follows. 


net = net_fn() 
trainer = torch.optim.SGD(net.parameters(), 1r=0.5) 
scheduler = 1r_scheduler.MultiStepLR(trainer, milestones=[15, 30], gamma=0.5) 


def get_Ir(trainer, scheduler): 
lr = scheduler.get_last_IrQ)[@] 
trainer.step() 


(continues on next page) 
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scheduler .step() 


return 1r 


d21.plot(torch.arange(num_epochs), [get_Ir(trainer, scheduler) 
for t in range(num_epochs) ]) 
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The intuition behind this piecewise constant learning rate schedule is that one lets opti- 
mization proceed until a stationary point has been reached in terms of the distribution of 
weight vectors. Then (and only then) do we decrease the rate such as to obtain a higher 
quality proxy to a good local minimum. The example below shows how this can produce 
ever slightly better solutions. 


train(net, train_iter, test_iter, num_epochs, loss, trainer, device, 
scheduler) 


train loss 0.194, train acc 0.927, test acc 0.869 


— train loss 
2.0 5 z 

=-=- train acc 

—-- test acc 
1.54 
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Cosine Scheduler 


A rather perplexing heuristic was proposed by Loshchilov and Hutter (2016). It relies on 
the observation that we might not want to decrease the learning rate too drastically in the 
beginning and moreover, that we might want to “refine” the solution in the end using a very 
small learning rate. This results in a cosine-like schedule with the following functional 
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form for learning rates in the range t € [0, T]. 


mt =r + Do (1 + c0s(n1/T)) (12.11.1) 


Here 7 is the initial learning rate, 77 is the target rate at time T. Furthermore, for t > T 
we simply pin the value to 77 without increasing it again. In the following example, we set 
the max update step T = 20. 


class CosineScheduler: 
def __init__(self, max_update, base_lr=0.01, final_1r=0, 
warmup_steps=0, warmup_begin_1r=0): 

self .base_lr_orig = base_lr 
self.max_update = max_update 
self.final_lr = final_lr 
self.warmup_steps = warmup_steps 
self .warmup_begin_lr = warmup_begin_Ir 
self.max_steps = self.max_update - self.warmup_steps 


def get_warmup_Ir(self, epoch): 
increase = (self.base_lr_orig - self.warmup_begin_Ir) \ 
x float(epoch) / float(self.warmup_steps) 
return self.warmup_begin_lr + increase 


def __call__(self, epoch): 
if epoch < self.warmup_steps: 
return self.get_warmup_1r (epoch) 
if epoch <= self.max_update: 
self.base_lr = self.final_Ir + ( 
self.base_lr_orig - self.final_Ir) x (1 + math.cos( 
math.pi * (epoch - self.warmup_steps) / self.max_steps)) / 2 
return self.base_Ir 


scheduler = CosineScheduler(max_update=20, base_lr=0.3, final_lr=0.01) 
d21.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs) ]) 
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In the context of computer vision this schedule can lead to improved results. Note, though, 
that such improvements are not guaranteed (as can be seen below). 


net = net_fn() 

trainer = torch.optim.SGD(net.parameters(), 1r=0.3) 

train(net, train_iter, test_iter, num_epochs, loss, trainer, device, 
scheduler) 
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train loss 0.159, train acc 0.942, test acc 0.904 


— train loss 
--- train acc 
—-- test acc 


Warmup 


In some cases initializing the parameters is not sufficient to guarantee a good solution. This 
is particularly a problem for some advanced network designs that may lead to unstable 
optimization problems. We could address this by choosing a sufficiently small learning 
rate to prevent divergence in the beginning. Unfortunately this means that progress is slow. 
Conversely, a large learning rate initially leads to divergence. 


A rather simple fix for this dilemma is to use a warmup period during which the learning rate 
increases to its initial maximum and to cool down the rate until the end of the optimization 
process. For simplicity one typically uses a linear increase for this purpose. This leads to 
a schedule of the form indicated below. 


scheduler = CosineScheduler(20, warmup_steps=5, base_lr=0.3, final_lr=0.01) 
d21.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs) ]) 
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Note that the network converges better initially (in particular observe the performance dur- 
ing the first 5 epochs). 


net = net_fn() 

trainer = torch.optim.SGD(net.parameters(), lr=0.3) 

train(net, train_iter, test_iter, num_epochs, loss, trainer, device, 
scheduler) 
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train loss 2.181, train acc 0.934, test acc 0.901 


2.04 —— train loss 
=-=- train acc 

154 —-- test acc 
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Warmup can be applied to any scheduler (not just cosine). For a more detailed discussion 
of learning rate schedules and many more experiments see also (Gotmare et al., 2018). In 
particular they find that a warmup phase limits the amount of divergence of parameters 
in very deep networks. This makes intuitively sense since we would expect significant 
divergence due to random initialization in those parts of the network that take the most 
time to make progress in the beginning. 


12.11.4 Summary 


e Decreasing the learning rate during training can lead to improved accuracy and (most 
perplexingly) reduced overfitting of the model. 


e A piecewise decrease of the learning rate whenever progress has plateaued is effective 
in practice. Essentially this ensures that we converge efficiently to a suitable solution 
and only then reduce the inherent variance of the parameters by reducing the learning 
rate. 


Cosine schedulers are popular for some computer vision problems. See e.g., GluonCV 
182 for details of such a scheduler. 


A warmup period before optimization can prevent divergence. 


Optimization serves multiple purposes in deep learning. Besides minimizing the training 
objective, different choices of optimization algorithms and learning rate scheduling 
can lead to rather different amounts of generalization and overfitting on the test set 
(for the same amount of training error). 


12.11.5 Exercises 


1. Experiment with the optimization behavior for a given fixed learning rate. What is the 
best model you can obtain this way? 


2. How does convergence change if you change the exponent of the decrease in the learning 
rate? Use PolyScheduler for your convenience in the experiments. 
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3. Apply the cosine scheduler to large computer vision problems, e.g., training ImageNet. 
How does it affect performance relative to other schedulers? 


4. How long should warmup last? 
5. Can you connect optimization and sampling? Start by using results from Welling and 
Teh (2011) on Stochastic Gradient Langevin Dynamics. 


Discussions 183, 
183 
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In deep learning, datasets and models are usually large, which involves heavy computa- 
tion. Therefore, computational performance matters a lot. This chapter will focus on the 
major factors that affect computational performance: imperative programming, symbolic 
programming, asynchronous computing, automatic parallelism, and multi-GPU computa- 
tion. By studying this chapter, you may further improve computational performance of 
those models implemented in the previous chapters, for example, by reducing training time 
without affecting accuracy. 


13.1 Compilers and Interpreters 
I) 


So far, this book has focused on imperative programming, which makes use of statements 
such as print, +, and if to change a program’s state. Consider the following example of a 
simple imperative program. 


def add(a, b): 
return a+b 


def fancy_func(a, b, c, d): 


e = add(a, b) 
f = add(c, d) 
g = add(e, f) 
return g 


print(fancy_func(1, 2, 3, 4)) 


10 


Python is an interpreted language. When evaluating the above fancy_func function it 
performs the operations making up the function’s body in sequence. That is, it will evaluate 
e = add(a, b) and store the results as variable e, thereby changing the program’s state. 
The next two statements f = add(c, d) andg = add(e, f) will be executed similarly, 
performing additions and storing the results as variables. Fig. 13.1.1 illustrates the flow of 
data. 
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Data flow in an imperative program. 


Although imperative programming is convenient, it may be inefficient. On the one hand, 
even if the add function is repeatedly called throughout fancy_func, Python will execute 
the three function calls individually. If these are executed, say, on a GPU (or even on mul- 
tiple GPUs), the overhead arising from the Python interpreter can become overwhelming. 
Moreover, it will need to save the variable values of e and f until all the statements in 
fancy_func have been executed. This is because we do not know whether the variables e 
and f will be used by other parts of the program after the statements e = add(a, b) and 
f = add(c, d) are executed. 


13.1.1 Symbolic Programming 


Consider the alternative, symbolic programming, where computation is usually performed 
only once the process has been fully defined. This strategy is used by multiple deep learning 
frameworks, including Theano and TensorFlow (the latter has acquired imperative exten- 
sions). It usually involves the following steps: 


1. Define the operations to be executed. 
2. Compile the operations into an executable program. 
3. Provide the required inputs and call the compiled program for execution. 


This allows for a significant amount of optimization. First, we can skip the Python inter- 
preter in many cases, thus removing a performance bottleneck that can become significant 
on multiple fast GPUs paired with a single Python thread on a CPU. Second, a compiler 
might optimize and rewrite the above code into print((1 + 2) + (3 + 4)) or even 
print(10). This is possible since a compiler gets to see the full code before turning it into 
machine instructions. For instance, it can release memory (or never allocate it) whenever a 
variable is no longer needed. Or it can transform the code entirely into an equivalent piece. 
To get a better idea, consider the following simulation of imperative programming (it is 
Python after all) below. 


def add_(): 
return 

def add(a, b): 
return a+b 


EEN 


pnd 


def fancy_func_(): 
return ''' 


(continues on next page) 
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def fancy_func(a, b, c, d): 


e = add(a, b) 
f = add(c, d) 
g = add(e, f) 
return ¢ 


def evoke_(): 
return add_() + fancy_func_() + ‘print(fancy_func(1, 2, 3, 4))' 


prog = evoke_() 

print (prog) 

y = compile(prog, '', 'exec’) 
exec(y) 


def add(a, b): 
return a+b 


def fancy_func(a, b, c, d): 


e = add(a, b) 

f = add(c, d) 

g = add(e, f) 

return g 
print(fancy_func(1, 2, 3, 4)) 


10 


The differences between imperative (interpreted) programming and symbolic programming 
are as follows: 


e Imperative programming is easier. When imperative programming is used in Python, 
the majority of the code is straightforward and easy to write. It is also easier to de- 
bug imperative programming code. This is because it is easier to obtain and print all 
relevant intermediate variable values, or use Python’s built-in debugging tools. 


e Symbolic programming is more efficient and easier to port. Symbolic programming 
makes it easier to optimize the code during compilation, while also having the ability 
to port the program into a format independent of Python. This allows the program to 
be run in a non-Python environment, thus avoiding any potential performance issues 
related to the Python interpreter. 


13.1.2 Hybrid Programming 


Historically most deep learning frameworks choose between an imperative or a symbolic 
approach. For example, Theano, TensorFlow (inspired by the former), Keras, and CNTK 
formulate models symbolically. Conversely, Chainer and PyTorch take an imperative ap- 
proach. An imperative mode was added to TensorFlow 2.0 and Keras in later revisions. 


As mentioned above, PyTorch is based on imperative programming and uses dynamic com- 
putation graphs. In an effort to leverage the portability and efficiency of symbolic program- 
ming, developers considered whether it would be possible to combine the benefits of both 
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programming paradigms. This led to a torchscript that lets users develop and debug us- 
ing pure imperative programming, while having the ability to convert most programs into 
symbolic programs to be run when product-level computing performance and deployment 
are required. 


13.1.3 Hybridizing the Sequential Class 


The easiest way to get a feel for how hybridization works is to consider deep networks with 
multiple layers. Conventionally the Python interpreter will need to execute the code for all 
layers to generate an instruction that can then be forwarded to a CPU or a GPU. For a single 
(fast) computing device this does not cause any major issues. On the other hand, if we use 
an advanced 8-GPU server such as an AWS P3dn.24xlarge instance Python will struggle to 
keep all GPUs busy. The single-threaded Python interpreter becomes the bottleneck here. 
Let’s see how we can address this for significant parts of the code by replacing Sequential 
with HybridSequential. We begin by defining a simple MLP. 


import torch 
from torch import nn 
from d21 import torch as d21 


# Factory for networks 
def get_net(): 
net = nn.Sequential(nn.Linear(512, 256), 


nn.ReLU(), 

nn.Linear(256, 128), 

nn.ReLU(), 

nn.Linear(128, 2)) 
return net 


x = torch.randn(size=(1, 512)) 
net = get_net() 
net (x) 


tensor([[-2.1602, @.0003]], grad_fn=<AddmmBackwardQ>) 


By converting the model using torch. jit.script function, we are able to compile and op- 
timize the computation in the MLP. The model’s computation result remains unchanged. 


net = torch. jit.script(net) 
net (x) 


tensor(L[-9.1602, @.0003]], grad_fn=<AddmmBackwardQ>) 


This seems almost too good to be true: write the same code as before and simply convert 
the model using torch. jit.script. Once this happens the network is optimized (we will 
benchmark the performance below). 
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Acceleration by Hybridization 


To demonstrate the performance improvement gained by compilation we compare the time 
needed to evaluate net (x) before and after hybridization. Let’s define a class to measure 
this time first. It will come handy throughout the chapter as we set out to measure (and 
improve) performance. 


#@save 
class Benchmark: 
"""For measuring running time. 
def __init__(self, description='Done’): 
self.description = description 


nnn 


def __enter__(self): 
self.timer = d21.Timer() 
return self 


def __exit__(self, xargs): 
print(f'{self.description}: {self.timer.stop():.4f} sec’) 


Now we can invoke the network twice, once with and once without torchscript. 


net = get_net() 
with Benchmark('Without torchscript’): 
for i in range(1000): net(x) 


net = torch. jit.script(net) 
with Benchmark('With torchscript’): 
for i in range(1000): net(x) 


Without torchscript: 2.1447 sec 
With torchscript: 4.0545 sec 


As is observed in the above results, after an nn.Sequential instance is scripted using 
the torch. jit.script function, computing performance is improved through the use of 
symbolic programming. 


Serialization 


One of the benefits of compiling the models is that we can serialize (save) the model and 
its parameters to disk. This allows us to store a model in a manner that is independent of 
the front-end language of choice. This allows us to deploy trained models to other devices 
and easily use other front-end programming languages. At the same time the code is often 
faster than what can be achieved in imperative programming. Let’s see the save function 
in action. 


net.save('my_mlp') 
!Is -lh my_mlpx 
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-rw-r--r-- 1 ci ci 651K Aug 18 19:32 my_mlp 


13.1.4 Summary 


e Imperative programming makes it easy to design new models since it is possible to write 
code with control flow and the ability to use a large amount of the Python software 
ecosystem. 


e Symbolic programming requires that we specify the program and compile it before exe- 
cuting it. The benefit is improved performance. 


13.1.5 Exercises 


1. Review the models that interest you in the previous chapters. Can you improve their 
computational performance by reimplementing them? 
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13.2 Asynchronous Computation 
E) 


Today’s computers are highly parallel systems, consisting of multiple CPU cores (often 
multiple threads per core), multiple processing elements per GPU, and often multiple GPUs 
per device. In short, we can process many different things at the same time, often on differ- 
ent devices. Unfortunately Python is not a great way of writing parallel and asynchronous 
code, at least not without some extra help. After all, Python is single-threaded and this is 
unlikely to change in the future. Deep learning frameworks such as MXNet and Tensor- 
Flow adopt an asynchronous programming model to improve performance, while Py Torch 
uses Python’s own scheduler leading to a different performance trade-off. For PyTorch, by 
default, GPU operations are asynchronous. When you call a function that uses the GPU, the 
operations are enqueued to the particular device, but not necessarily executed until later. 
This allows us to execute more computations in parallel, including operations on the CPU 
or other GPUs. 


Hence, understanding how asynchronous programming works helps us to develop more 
efficient programs, by proactively reducing computational requirements and mutual de- 
pendencies. This allows us to reduce memory overhead and increase processor utiliza- 
tion. 


import os 

import subprocess 

import numpy 

import torch 

from torch import nn 

from d21 import torch as d21 
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13.2.1 Asynchrony via Backend 


For a warmup consider the following toy problem: we want to generate a random matrix 
and multiply it. Let’s do that both in NumPy and in PyTorch tensor to see the difference. 
Note that PyTorch tensor is defined on a GPU. 


# Warmup for GPU computation 

device = d21.try_gpu() 

a = torch.randn(size=(1000, 1000), device=device) 
b = torch.mm(a, a) 


with d21.Benchmark('numpy'): 
for _ in range(10): 


a = numpy.random.normal(size=(1000, 1000)) 
b = numpy.dot(a, a) 


iT 


with d21.Benchmark(' torch’): 
for _ in range(10): 


a = torch.randn(size=(1000, 1000), device=device) 
b = torch.mm(a, a) 


numpy: 1.4693 sec 
torch: 0.0022 sec 


The benchmark output via PyTorch is orders of magnitude faster. NumPy dot product is ex- 
ecuted on the CPU processor while PyTorch matrix multiplication is executed on GPU and 
hence the latter is expected to be much faster. But the huge time difference suggests some- 
thing else must be going on. By default, GPU operations are asynchronous in PyTorch. 
Forcing PyTorch to finish all computation prior to returning shows what happened previ- 
ously: computation is being executed by the backend while the frontend returns control to 
Python. 


with d21.Benchmark(): 
for _ in range(10): 


a = torch.randn(size=(1000, 1000), device=device) 
b = torch.mm(a, a) 
torch. cuda.synchronize(device) 


Done: 0.0058 sec 


Broadly speaking, PyTorch has a frontend for direct interaction with the users, e.g., via 
Python, as well as a backend used by the system to perform the computation. As shown 
in Fig. 13.2.1, users can write PyTorch programs in various frontend languages, such as 
Python and C++. Regardless of the frontend programming language used, the execution of 
PyTorch programs occurs primarily in the backend of C++ implementations. Operations 
issued by the frontend language are passed on to the backend for execution. The backend 
manages its own threads that continuously collect and execute queued tasks. Note that for 
this to work the backend must be able to keep track of the dependencies between various 
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steps in the computational graph. Hence, it is not possible to parallelize operations that 
depend on each other. 


Language Computing 
frontend device 


ZScala 


® pyth 
DP euthon xnet — 


R Framework SS 
C- backend : 


(scheduler, kernel, etc.) 


Intel® Core™? 


Programming language frontends and deep learning framework backends. 


Let’s look at another toy example to understand the dependency graph a bit better. 


= torch.ones((1, 2), device=device) 
= torch.ones((1, 2), device=device) 
=x*xyt 2 


N NSX X 
I 


tensor ([[3., 3.]], device='cuda:0') 


ones(1, 2) ones(1, 2) 


-+2 


The backend tracks dependencies between various steps in the computational graph. 


The code snippet above is also illustrated in Fig. 13.2.2. Whenever the Python frontend 
thread executes one of the first three statements, it simply returns the task to the backend 
queue. When the last statement’s results need to be printed, the Python frontend thread 
will wait for the C++ backend thread to finish computing the result of the variable z. One 
benefit of this design is that the Python frontend thread does not need to perform actual 
computations. Thus, there is little impact on the program’s overall performance, regardless 
of Python’s performance. Fig. 13.2.3 illustrates how frontend and backend interact. 


13.2.2 Barriers and Blockers 


13.2.3 Improving Computation 
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Frontend print(z) 


Backend 


Interactions of the frontend and backend. 


13.2.4 Summary 


e Deep learning frameworks may decouple the Python frontend from an execution back- 
end. This allows for fast asynchronous insertion of commands into the backend and 
associated parallelism. 


e Asynchrony leads to a rather responsive frontend. However, use caution not to overfill 
the task queue since it may lead to excessive memory consumption. It is recommended 
to synchronize for each minibatch to keep frontend and backend approximately syn- 
chronized. 


e Chip vendors offer sophisticated performance analysis tools to obtain a much more fine- 
grained insight into the efficiency of deep learning. 


13.2.5 Exercises 


1. On the CPU, benchmark the same matrix multiplication operations in this section. Can 
you still observe asynchrony via the backend? 


Discussions !®°. 


13.3 Automatic Parallelism 
E)E 


Deep learning frameworks (e.g., MXNet and PyTorch) automatically construct computa- 
tional graphs at the backend. Using a computational graph, the system is aware of all the 
dependencies, and can selectively execute multiple non-interdependent tasks in parallel to 
improve speed. For instance, Fig. 13.2.2 in Section 13.2 initializes two variables indepen- 
dently. Consequently the system can choose to execute them in parallel. 


Typically, a single operator will use all the computational resources on all CPUs or on a sin- 
gle GPU. For example, the dot operator will use all cores (and threads) on all CPUs, even if 
there are multiple CPU processors on a single machine. The same applies to a single GPU. 
Hence parallelization is not quite so useful for single-device computers. With multiple de- 
vices things matter more. While parallelization is typically most relevant between multiple 
GPUs, adding the local CPU will increase performance slightly. For example, see Hadjis et 
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al. (2016) that focuses on training computer vision models combining a GPU and a CPU. 
With the convenience of an automatically parallelizing framework we can accomplish the 
same goal in a few lines of Python code. More broadly, our discussion of automatic parallel 
computation focuses on parallel computation using both CPUs and GPUs, as well as the 
parallelization of computation and communication. 


Note that we need at least two GPUs to run the experiments in this section. 


import torch 
from d21 import torch as d21 


13.3.1 Parallel Computation on GPUs 


Let’s start by defining a reference workload to test: the run function below performs 10 
matrix-matrix multiplications on the device of our choice using data allocated into two 
variables: x_gpul and x_gpu2. 


devices = d21.try_all_gpus() 
def run(x): 
return [x.mm(x) for 


in range(50)] 


x_gpul = torch.rand(size=(4000, 4000), device=devices[Q]) 
x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1]) 


Now we apply the function to the data. To ensure that caching does not play a role in the 
results we warm up the devices by performing a single pass on either of them prior to mea- 
suring. torch. cuda. synchronize() waits for all kernels in all streams on a CUDA device 
to complete. It takes in a device argument, the device for which we need to synchronize. 
It uses the current device, given by current_device(), if the device argument is None 
(default). 


run(x_gpu1) 

run(x_gpu2) # Warm-up all devices 
torch. cuda. synchronize(devices[0]) 
torch. cuda.synchronize(devices[1]) 


with d21.Benchmark('’GPU1 time’): 
run(x_gpu1) 
torch. cuda.synchronize(devices[0]) 


with d21.Benchmark('’GPU2 time’): 
run(x_gpu2) 
torch. cuda.synchronize(devices[1]) 


GPU1 time: @.466@ sec 
GPU2 time: 0.4510 sec 


If we remove the synchronize statement between both tasks the system is free to parallelize 
computation on both devices automatically. 
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with d21.Benchmark('’GPU1 & GPU2'): 
run(x_gpu1) 
run(x_gpu2) 
torch. cuda.synchronize() 


GPU1 & GPU2: @.4659 sec 


In the above case the total execution time is less than the sum of its parts, since the deep 
learning framework automatically schedules computation on both GPU devices without the 
need for sophisticated code on behalf of the user. 


13.3.2 Parallel Computation and Communication 


In many cases we need to move data between different devices, say between the CPU and 
GPU, or between different GPUs. For instance, this occurs when we want to perform dis- 
tributed optimization where we need to aggregate the gradients over multiple accelerator 
cards. Let’s simulate this by computing on the GPU and then copying the results back to 
the CPU. 


def copy_to_cpu(x, non_blocking=False): 
return [Ly.to(’cpu’, non_blocking=non_blocking) for y in x] 


with d21.Benchmark(’Run on GPU1"'): 


y = run(x_gpul) 
torch. cuda.synchronize() 


with d21.Benchmark('’Copy to CPU’): 


y_cpu = copy_to_cpu(y) 
torch. cuda.synchronize() 


Run on GPU1: @.4656 sec 
Copy to CPU: 2.3125 sec 


This is somewhat inefficient. Note that we could already start copying parts of y to the CPU 
while the remainder of the list is still being computed. This situation occurs, e.g., when we 
compute the (backprop) gradient on a minibatch. The gradients of some of the parameters 
will be available earlier than that of others. Hence it works to our advantage to start using 
PCI-Express bus bandwidth while the GPU is still running. In PyTorch, several functions 
such as to() and copy_() admit an explicit non_blocking argument, which lets the caller 
bypass synchronization when it is unnecessary. Setting non_blocking=True allows us to 
simulate this scenario. 


with d21.Benchmark(’Run on GPU1 and copy to CPU’): 
y = run(x_gpul) 
y_cpu = copy_to_cpu(y, True) 
torch. cuda.synchronize() 
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Run on GPU1 and copy to CPU: 1.6907 sec 


The total time required for both operations is (as expected) less than the sum of their parts. 
Note that this task is different from parallel computation as it uses a different resource: the 
bus between the CPU and GPUs. In fact, we could compute on both devices and communi- 
cate, all at the same time. As noted above, there is a dependency between computation and 
communication: y[i] must be computed before it can be copied to the CPU. Fortunately, 
the system can copy y[i-1] while computing y[i] to reduce the total running time. 


We conclude with an illustration of the computational graph and its dependencies for a 
simple two-layer MLP when training on a CPU and two GPUs, as depicted in Fig. 13.3.1. It 
would be quite painful to schedule the parallel program resulting from this manually. This is 
where it is advantageous to have a graph-based computing backend for optimization. 


data[gpu0].copyfrom(data[0:50]) — data = next_batch() l data[gpu0].copyfrom(data[51:100]) 


fc1[gpu0] = | fc2_wgrad[cpu] = | fc1[gpu1] = 


FullcForward(data[gpu0], fc2_wgrad[gpu0] + FullcForward(data[gpu1], 
fc1_weight[gpu0]) fc2_wgrad[gpu1] fc1_weight[gpu1]) 


fc2[gpu0] = fc2_weight[cpu] -= fc2[gput] = 
FullcForward(fc1[gpu0], Ir*fc12_wgrad[gpu0] FullcForward(fe1[gpu1], 
fc2_weight[gpu0]) = fc2_weight[gpu1]) 


fc2_weight[cpu].copyto( 
fc2_weight[gpu0] , 
fc2_weight[gpu1]) 


fc2_ograd[gpu0] 


u0] = fc2_ograd[gpu1] = 
LossGrad(fc2[gpu0], label[0:50]) 


LossGrad(fc2[gpu1], label[51:100]) 


fc1_ograd[gpu0], fe2_wgrad[gpu0] fe1_wgrad[cpu] = fc1_ograd[gpu'], fe2_wgrad[gpu1] 
= FullcBackward(fc2_ograd[gpu0] , fc1_wgrad[gpu0] + = FullcBackward(fc2_ograd[gpu1] , 
fc2_weight[gpu0]) fc1_wgrad[gpu1 fc2_weight[gpu1]) 
Yy 


FullcBackward(fc1_ograd[gpu0] , fc1_wgrad[gpu0. FullcBackward(fc1_ograd[gpu1] , 


fc1_weight[gpu0]) fc1_weight[gpu1]) 
Y 
fc1_weight[cpu].copyto( 
fc1_weight[gpu0] , 
fc1_weight[gpu1]) 


The computational graph and its dependencies of a two-layer MLP on a CPU and two 
GPUs. 


— fc1_wgrad[gpu0] = | fe1_weight[cpu] -= Ir * | — fc1_wgrad[gput] = 


13.3.3 Summary 


e Modern systems have a variety of devices, such as multiple GPUs and CPUs. They can 
be used in parallel, asynchronously. 


e Modern systems also have a variety of resources for communication, such as PCI Ex- 
press, storage (typically solid-state drives or via networks), and network bandwidth. 
They can be used in parallel for peak efficiency. 


e The backend can improve performance through automatic parallel computation and com- 
munication. 
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13.3.4 Exercises 


1. Eight operations were performed in the run function defined in this section. There 
are no dependencies between them. Design an experiment to see if the deep learning 
framework will automatically execute them in parallel. 


2. When the workload of an individual operator is sufficiently small, parallelization can 
help even on a single CPU or GPU. Design an experiment to verify this. 


3. Design an experiment that uses parallel computation on CPUs, GPUs, and communica- 
tion between both devices. 


4. Use a debugger such as NVIDIA’s Nsight*®° to verify that your code is efficient. 


5. Designing computation tasks that include more complex data dependencies, and run 
experiments to see if you can obtain the correct results while improving performance. 


187 Discussions !87 . 


13.4 Hardware 


Building systems with great performance requires a good understanding of the algorithms 
and models to capture the statistical aspects of the problem. At the same time it is also 
indispensable to have at least a modicum of knowledge of the underlying hardware. The 
current section is no substitute for a proper course on hardware and system design. Instead, 
it might serve as a starting point for understanding why some algorithms are more efficient 
than others and how to achieve good throughput. A good design can easily make a differ- 
ence of an order of magnitude and, in turn, this can make the difference between being able 
to train a network (e.g., in a week) and not at all (in 3 months, thus missing the deadline). 
We will start by looking at computers. Then we will zoom in to look more carefully at 
CPUs and GPUs. Lastly we zoom out to review how multiple computers are connected in 
a server center or in the cloud. 


n ins . Main memory reference: Send 2,000 bytes over Read 1,000,000 bytes 
100ns commodity network: sequentially from SSD: 
44ns 49,000ns = 49us 
. L1 cache reference: ins ummm 1,000ns = 1s 
m SSD random read: am Disk seek: 2,000,000ns = 
y ý 16,000ns % 16s 2ms 
man : r 
Branch mispredict: 3ns HAHA Compress 1KB wth Zippy: 
2,000ns #'2us 1 Read 1,000,000 bytes 1. Read 1,000,000 bytes 
nunn L2 cache reference: 4ns sequentially from sequentially from disk: 
Essasssaas 10,000ns = 10s = E memory: 3,000ns = 3ps 825,000ns ~ 825us 


manm : 
J Mutex lock/unlock: 17ns ESEE Round trip in same HHHH 


EEE datacenter: 500,000ns ~ 


cket roundtrip CA to 
letherlands: 
150,000,000ns = 150ms 


BEE 500ys 
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4 1| Latency Numbers that every programmer should know. 


560 


188 


Computational Performance 


Impatient readers may be able to get by with Fig. 13.4.1. Itis taken from Colin Scott’s inter- 
active post188 that gives a good overview of the progress over the past decade. The original 
numbers are due to Jeff Dean’s Stanford talk from 2010189. The discussion below explains 
some of the rationale for these numbers and how they can guide us in designing algorithms. 
The discussion below is very high level and cursory. It is clearly no substitute for a proper 
course but rather just meant to provide enough information for a statistical modeler to make 
suitable design decisions. For an in-depth overview of computer architecture we refer the 
reader to (Hennessy and Patterson, 2011) or a recent course on the subject, such as the one 


by Arste Asanovic!9°. 


13.4.1 Computers 


Most deep learning researchers and practitioners have access to a computer with a fair 
amount of memory, computation, some form of an accelerator such as a GPU, or multiples 
thereof. A computer consists of the following key components: 


e A processor (also referred to as a CPU) that is able to execute the programs we give it (in 
addition to running an operating system and many other things), typically consisting 
of 8 or more cores. 


Memory (RAM) to store and retrieve the results from computation, such as weight vec- 
tors and activations, and training data. 


An Ethernet network connection (sometimes multiple) with speeds ranging from 1 GB/s 
to 100 GB/s. On high end servers more advanced interconnects can be found. 


A high speed expansion bus (PCIe) to connect the system to one or more GPUs. Servers 
have up to 8 accelerators, often connected in an advanced topology, while desktop 
systems have 1 or 2, depending on the budget of the user and the size of the power 


supply. 


Durable storage, such as a magnetic hard disk drive, a solid state drive, in many cases 
connected using the PCIe bus. It provides efficient transfer of training data to the 
system and storage of intermediate checkpoints as needed. 


PCle bus 


Chipset 


Connectivity of components of a computer. 


As Fig. 13.4.2 indicates, most components (network, GPU, and storage) are connected to 
the CPU across the PCIe bus. It consists of multiple lanes that are directly attached to the 
CPU. For instance AMD’s Threadripper 3 has 64 PCle 4.0 lanes, each of which is capable 
16 Gbit/s data transfer in both directions. The memory is directly attached to the CPU with 
a total bandwidth of up to 100 GB/s. 


When we run code on a computer we need to shuffle data to the processors (CPUs or GPUs), 
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perform computation, and then move the results off the processor back to RAM and durable 
storage. Hence, in order to get good performance we need to make sure that this works 
seamlessly without any one of the systems becoming a major bottleneck. For instance, if 
we cannot load images quickly enough the processor will not have any work to do. Likewise, 
if we cannot move matrices quickly enough to the CPU (or GPU), its processing elements 
will starve. Finally, if we want to synchronize multiple computers across the network, the 
latter should not slow down computation. One option is to interleave communication and 
computation. Let’s have a look at the various components in more detail. 


13.4.2 Memory 


At its most basic memory is used to store data that needs to be readily accessible. At 
present CPU RAM is typically of the DDR4!°! variety, offering 20-25 GB/s bandwidth 
per module. Each module has a 64-bit-wide bus. Typically pairs of memory modules are 
used to allow for multiple channels. CPUs have between 2 and 4 memory channels, i.e., 
they have between 4 OGB/s and 100 GB/s peak memory bandwidth. Often there are two 
banks per channel. For instance AMD’s Zen 3 Threadripper has 8 slots. 


While these numbers are impressive, indeed, they only tell part of the story. When we 
want to read a portion from memory we first need to tell the memory module where the 
information can be found. That is, we first need to send the address to RAM. Once this 
is accomplished we can choose to read just a single 64 bit record or a long sequence of 
records. The latter is called burst read. In a nutshell, sending an address to memory and 
setting up the transfer takes approximately 100 ns (details depend on the specific timing 
coefficients of the memory chips used), every subsequent transfer takes only 0.2 ns. In 
short, the first read is 500 times as expensive as subsequent ones! Note that we could 
perform up to 10,000,000 random reads per second. This suggests that we avoid random 
memory access as far as possible and use burst reads (and writes) instead. 


Matters are a bit more complex when we take into account that we have multiple banks. 
Each bank can read memory largely independently. This means two things. On the one 
hand, the effective number of random reads is up to 4 times higher, provided that they are 
spread evenly across memory. It also means that it is still a bad idea to perform random 
reads since burst reads are 4 times faster, too. On the other hand, due to memory alignment 
to 64 bit boundaries it is a good idea to align any data structures with the same boundaries. 
192 when the appropriate flags are set. Curious 


readers are encouraged to review a lecture on DRAMs such as the one by Zeshan Chishti 
193 


Compilers do this pretty much automatically 


GPU memory is subject to even higher bandwidth requirements since they have many more 
processing elements than CPUs. By and large there are two options to address them. The 
first is to make the memory bus significantly wider. For instance, NVIDIA’s RTX 2080 
Ti has a 352-bit-wide bus. This allows for much more information to be transferred at 
the same time. Second, GPUs use specific high-performance memory. Consumer-grade 
devices, such as NVIDIA’s RTX and Titan series typically use GDDR6!* chips with over 
500 GB/s aggregate bandwidth. An alternative is to use HBM (high bandwidth memory) 
modules. They use a very different interface and connect directly with GPUs on a dedicated 


562 


Computational Performance 


silicon wafer. This makes them very expensive and their use is typically limited to high-end 
server chips, such as the NVIDIA Volta V100 series of accelerators. Quite unsurprisingly, 
GPU memory is generally much smaller than CPU memory due to the higher cost of the 
former. For our purposes, by and large their performance characteristics are similar, just a 
lot faster. We can safely ignore the details for the purpose of this book. They only matter 
when tuning GPU kernels for high throughput. 


13.4.3 Storage 


We saw that some of the key characteristics of RAM are bandwidth and latency. The same 
is true for storage devices, just that the differences can be even more extreme. 


Hard Disk Drives 


Hard disk drives (HDDs) have been in use for over half a century. In a nutshell they contain 
a number of spinning platters with heads that can be positioned to read or write at any given 
track. High-end disks hold up to 16 TB on 9 platters. One of the key benefits of HDDs 
is that they are relatively inexpensive. One of their many downsides are their typically 
catastrophic failure modes and their relatively high read latency. 


To understand the latter, consider the fact that HDDs spin at around 7,200 RPM (revolutions 
per minute). If they were much faster they would shatter due to the centrifugal force exerted 
on the platters. This has a major downside when it comes to accessing a specific sector 
on the disk: we need to wait until the platter has rotated in position (we can move the 
heads but not accelerate the actual disks). Hence it can take over 8 ms until the requested 
data is available. A common way this is expressed is to say that HDDs can operate at 
approximately 100 IOPs (input/output operations per second). This number has essentially 
remained unchanged for the past two decades. Worse still, it is equally difficult to increase 
bandwidth (it is in the order of 100-200 MB/s). After all, each head reads a track of bits, 
hence the bit rate only scales with the square root of the information density. As a result, 
HDDs are quickly becoming relegated to archival storage and low-grade storage for very 
large datasets. 


Solid State Drives 


Solid state drives (SSDs) use flash memory to store information persistently. This allows 
for much faster access to stored records. Modern SSDs can operate at 100,000 to 500,000 
IOPs, i.e., up to 3 orders of magnitude faster than HDDs. Furthermore, their bandwidth 
can reach 1-3GB/s, i.e., one order of magnitude faster than HDDs. These improvements 
sound almost too good to be true. Indeed, they come with the following caveats, due to the 
way SSDs are designed. 


e SSDs store information in blocks (256 KB or larger). They can only be written as a whole, 
which takes significant time. Consequently bit-wise random writes on SSD have very 
poor performance. Likewise, writing data in general takes significant time since the 
block has to be read, erased and then rewritten with new information. By now SSD 
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controllers and firmware have developed algorithms to mitigate this. Nonetheless, 
writes can be much slower, in particular for QLC (quad level cell) SSDs. The key 
for improved performance is to maintain a queue of operations, to prefer reads and to 
write in large blocks if possible. 


e The memory cells in SSDs wear out relatively quickly (often already after a few thousand 
writes). Wear-level protection algorithms are able to spread the degradation over many 
cells. That said, it is not recommended to use SSDs for swapping files or for large 
aggregations of log-files. 


e Lastly, the massive increase in bandwidth has forced computer designers to attach SSDs 
directly to the PCIe bus. The drives capable of handling this, referred to as NVMe 
(Non Volatile Memory enhanced), can use up to 4 PCIe lanes. This amounts to up to 
8GB/s on PCIe 4.0. 


Cloud Storage 


Cloud storage provides a configurable range of performance. That is, the assignment of 
storage to virtual machines is dynamic, both in terms of quantity and in terms of speed, 
as chosen by users. We recommend that users increase the provisioned number of IOPs 
whenever latency is too high, e.g., during training with many small records. 


13.4.4 CPUs 


Central processing units (CPUs) are the centerpiece of any computer. They consist of a 
number of key components: processor cores that are able to execute machine code, a bus 
connecting them (the specific topology differs significantly between processor models, gen- 
erations, and vendors), and caches to allow for higher bandwidth and lower latency memory 
access than what is possible by reads from main memory. Lastly, almost all modern CPUs 
contain vector processing units to aid with high performance linear algebra and convolu- 
tions, as they are common in media processing and machine learning. 
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a S 3) Intel Skylake consumer quad-core CPU. 


Fig. 13.4.3 depicts an Intel Skylake consumer-grade quad-core CPU. It has an integrated 
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GPU, caches, and a ringbus connecting the four cores. Peripherals, such as Ethernet, WiFi, 
Bluetooth, SSD controller, and USB, are either part of the chipset or directly attached 
(PCIe) to the CPU. 


Microarchitecture 


Each of the processor cores consists of a rather sophisticated set of components. While 
details differ between generations and vendors, the basic functionality is pretty much stan- 
dard. The front-end loads instructions and tries to predict which path will be taken (e.g., 
for control flow). Instructions are then decoded from assembly code to microinstructions. 
Assembly code is often not the lowest level code that a processor executes. Instead, com- 
plex instructions may be decoded into a set of more lower level operations. These are then 
processed by the actual execution core. Often the latter is capable of performing many op- 
erations simultaneously. For instance, the ARM Cortex A77 core of Fig. 13.4.4 is able to 
perform up to 8 operations simultaneously. 
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5 Integer 
Single Cycle 1 
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Single Cycle 0 


Integer Single / 
Decode Multi Cycle 1 
Fetch Rename =e 
Dispatch 
FP/ASIMD 1 
FP/ASIMD 0 
Load / Store 1 
Load / Store 0 
>| Store Data 1 


Store Data 0 


ARM Cortex A77 Microarchitecture. 


This means that efficient programs might be able to perform more than one instruction per 
clock cycle, provided that they can be carried out independently. Not all units are created 
equal. Some specialize in integer instructions whereas others are optimized for floating 
point performance. To increase throughput, the processor might also follow multiple code 
paths simultaneously in a branching instruction and then discard the results of the branches 
not taken. This is why branch prediction units matter (on the front-end) such that only the 
most promising paths are pursued. 


Vectorization 


Deep learning is extremely compute-hungry. Hence, to make CPUs suitable for machine 
learning, one needs to perform many operations in one clock cycle. This is achieved via 
vector units. They have different names: on ARM they are called NEON, on x86 they (a 
recent generation) are referred to as AVX21°° units. A common aspect is that they are able 
to perform SIMD (single instruction multiple data) operations. Fig. 13.4.5 shows how 8 
short integers can be added in one clock cycle on ARM. 
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128 bit NEON vectorization. 


Depending on architecture choices, such registers are up to 512 bits long, allowing for the 
combination of up to 64 pairs of numbers. For instance, we might be multiplying two 
numbers and adding them to a third, which is also known as a fused multiply-add. Intel’s 
OpenVino 19° 
grade CPUs. Note, though, that this number is entirely dwarfed by what GPUs are capable 
of achieving. For instance, NVIDIA’s RTX 2080 Ti has 4,352 CUDA cores, each of which 
is capable of processing such an operation at any time. 


uses these to achieve respectable throughput for deep learning on server- 


Cache 


Consider the following situation: we have a modest CPU core with 4 cores as depicted in 
Fig. 13.4.3 above, running at 2 GHz frequency. Moreover, let’s assume that we have an IPC 
(instructions per clock) count of 1 and that the units have AVX2 with 256-bit width enabled. 
Let’s furthermore assume that at least one of the registers used for AVX2 operations needs 
to be retrieved from memory. This means that the CPU consumes 4 x 256 bit = 128 bytes 
of data per clock cycle. Unless we are able to transfer 2 x 10° x 128 = 256 x 10° bytes 
to the processor per second the processing elements are going to starve. Unfortunately the 
memory interface of such a chip only supports 20-40 GB/s data transfer, i.e., one order of 
magnitude less. The fix is to avoid loading new data from memory as far as possible and 
rather to cache it locally on the CPU. This is where caches come in handy. Commonly the 
following names or concepts are used: 


e Registers are strictly speaking not part of the cache. They help stage instructions. That 
said, CPU registers are memory locations that a CPU can access at clock speed with- 
out any delay penalty. CPUs have tens of registers. It is up to the compiler (or pro- 
grammer) to use registers efficiently. For instance the C programming language has a 
register keyword. 


e L1 caches are the first line of defense against high memory bandwidth requirements. 
L1 caches are tiny (typical sizes might be 32-64 KB) and often split into data and 
instructions caches. When data is found in the L1 cache, access is very fast. If they 
cannot be found there, the search progresses down the cache hierarchy. 


e L2 caches are the next stop. Depending on architecture design and processor size they 
might be exclusive. They might be accessible only by a given core or shared among 
multiple cores. L2 caches are larger (typically 256-512 KB per core) and slower than 
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L1. Furthermore, to access something in L2 we first need to check to realize that the 
data is not in L1, which adds a small amount of extra latency. 


e L3 caches are shared among multiple cores and can be quite large. AMD’s Epyc 3 server 
CPUs have a whopping 256 MB of cache spread across multiple chiplets. More typical 
numbers are in the 4-8 MB range. 


Predicting which memory elements will be needed next is one of the key optimization pa- 
rameters in chip design. For instance, it is advisable to traverse memory in a forward direc- 
tion since most caching algorithms will try to read ahead rather than backwards. Likewise, 
keeping memory access patterns local is a good way of improving performance. 


Adding caches is a double-edge sword. On the one hand they ensure that the processor 
cores do not starve of data. At the same time they increase chip size, using up area that 
otherwise could have been spent on increasing processing power. Moreover, cache misses 
can be expensive. Consider the worst case scenario, false sharing, as depicted in Fig. 13.4.6. 
A memory location is cached on processor 0 when a thread on processor 1 requests the 
data. To obtain it, processor 0 needs to stop what it is doing, write the information back 
to main memory and then let processor 1 read it from memory. During this operation both 
processors wait. Quite potentially such code runs more slowly on multiple processors when 
compared with an efficient single-processor implementation. This is one more reason for 
why there is a practical limit to cache sizes (besides their physical size). 


Socket 0 Socket 1 


Thread 0 Thread 1 


Cache Line Shared Cache Line 
cache line 


False sharing (image courtesy of Intel). 


13.4.5 GPUs and other Accelerators 


It is not an exaggeration to claim that deep learning would not have been successful without 
GPUs. By the same token, it is quite reasonable to argue that GPU manufacturers’ fortunes 
have increased significantly due to deep learning. This co-evolution of hardware and al- 
gorithms has led to a situation where for better or worse deep learning is the preferable 
statistical modeling paradigm. Hence it pays to understand the specific benefits that GPUs 
and related accelerators such as the TPU (Jouppi et al., 2017). 


Of note is a distinction that is often made in practice: accelerators are optimized either for 
training or inference. For the latter we only need to compute the forward propagation in 
a network. No storage of intermediate data is needed for backpropagation. Moreover, we 
may not need very precise computation (FP16 or INT8 typically suffice). On the other hand, 
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during training all intermediate results need storage to compute gradients. Moreover, ac- 
cumulating gradients requires higher precision to avoid numerical underflow (or overflow). 
This means that FP16 (or mixed precision with FP32) is the minimum requirement. All 
of this necessitates faster and larger memory (HBM2 vs. GDDR6) and more processing 
power. For instance, NVIDIA’s Turing !°’ T4 GPUs are optimized for inference whereas 
the V100 GPUs are preferable for training. 


197 


Recall vectorization as illustrated in Fig. 13.4.5. Adding vector units to a processor core 
allowed us to increase throughput significantly. For example, in the example in Fig. 13.4.5 
we were able to perform 16 operations simultaneously. First, what if we added operations 
that optimized not just operations between vectors but also between matrices? This strategy 
led to tensor cores (to be covered shortly). Second, what if we added many more cores? 
In a nutshell, these two strategies summarize the design decisions in GPUs. Fig. 13.4.7 
gives an overview of a basic processing block. It contains 16 integer and 16 floating point 
units. In addition to that, two tensor cores accelerate a narrow subset of additional op- 
erations relevant for deep learning. Each streaming multiprocessor consists of four such 
blocks. 


Warp Scheduler + Dispatch (32 thread/clk) 


Register File (16,384 x 32-bit) 


TENSOR 
CORES 


INT32 FP32 


LDIST LD/ST LD/ST  LDIST SFU 


a ii! NVIDIA Turing processing block (image courtesy of NVIDIA). 


Next, 12 streaming multiprocessors are grouped into graphics processing clusters which 
make up the high-end TU102 processors. Ample memory channels and an L2 cache com- 
plement the setup. Fig. 13.4.8 has the relevant details. One of the reasons for designing 
such a device is that individual blocks can be added or removed as needed to allow for more 
compact chips and to deal with yield issues (faulty modules might not be activated). Fortu- 
nately programming such devices is well hidden from the casual deep learning researcher 
beneath layers of CUDA and framework code. In particular, more than one of the programs 
might well be executed simultaneously on the GPU, provided that there are available re- 
sources. Nonetheless it pays to be aware of the limitations of the devices to avoid picking 
models that do not fit into device memory. 


A last aspect that is worth mentioning in more detail are tensor cores. They are an example 
of a recent trend of adding more optimized circuits that are specifically effective for deep 
learning. For instance, the TPU added a systolic array (Kung, 1988) for fast matrix multipli- 
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PCI Express 3.0 Host Interface 


a ii) NVIDIA Turing architecture (image courtesy of NVIDIA) 


cation. There the design was to support a very small number (one for the first generation of 
TPUs) of large operations. Tensor cores are at the other end. They are optimized for small 
operations involving between 4 x 4 and 16 x 16 matrices, depending on their numerical 
precision. Fig. 13.4.9 gives an overview of the optimizations. 
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iste 45) NVIDIA tensor cores in Turing (image courtesy of NVIDIA). 


Obviously when optimizing for computation we end up making certain compromises. One 
198 of them is that GPUs are not very good at handling interrupts and sparse data. While there 
tha are notable exceptions, such as Gunrock!9* (Wang et al., 2016), the access pattern of sparse 
ime matrices and vectors do not go well with the high bandwidth burst read operations where 


569 


Hardware 


GPUs excel. Matching both goals is an area of active research. See e.g., DGL!%, a library 
tuned for deep learning on graphs. 


13.4.6 Networks and Buses 


Whenever a single device is insufficient for optimization we need to transfer data to and 
from it to synchronize processing. This is where networks and buses come in handy. We 
have a number of design parameters: bandwidth, cost, distance, and flexibility. On one end 
we have WiFi that has a pretty good range, is very easy to use (no wires, after all), cheap but 
it offers comparatively mediocre bandwidth and latency. No machine learning researcher 
within their right mind would use it to build a cluster of servers. In what follows we focus 
on interconnects that are suitable for deep learning. 


e PCle is a dedicated bus for very high bandwidth point-to-point connections (up to 32 
GB/s on PCIe 4.0 in a 16-lane slot) per lane. Latency is in the order of single-digit 
microseconds (5 us). PCIe links are precious. Processors only have a limited number 
of them: AMD’s EPYC 3 has 128 lanes, Intel’s Xeon has up to 48 lanes per chip; 
on desktop-grade CPUs the numbers are 20 (Ryzen 9) and 16 (Core 19) respectively. 
Since GPUs have typically 16 lanes, this limits the number of GPUs that can connect 
to the CPU at full bandwidth. After all, they need to share the links with other high 
bandwidth peripherals such as storage and Ethernet. Just like with RAM access, large 
bulk transfers are preferable due to reduced packet overhead. 


e Ethernet is the most commonly used way of connecting computers. While it is signifi- 
cantly slower than PCle, it is very cheap and resilient to install and covers much longer 
distances. Typical bandwidth for low-grade servers is 1 GBit/s. Higher-end devices 
(e.g., C5 instances 200 in the cloud) offer between 10 and 100 GBit/s bandwidth. As 
in all previous cases data transmission has significant overheads. Note that we al- 
most never use raw Ethernet directly but rather a protocol that is executed on top of 
the physical interconnect (such as UDP or TCP/IP). This adds further overhead. Like 
PCle, Ethernet is designed to connect two devices, e.g., a computer and a switch. 


e Switches allow us to connect multiple devices in a manner where any pair of them can 
carry out a (typically full bandwidth) point-to-point connection simultaneously. For 
instance, Ethernet switches might connect 40 servers at high cross-sectional band- 
width. Note that switches are not unique to traditional computer networks. Even PCle 
lanes can be switched 7°1 


host processor, as is the case for the P2 instances 


. This occurs, e.g., to connect a large number of GPUs to a 
202 


e NVLink is an alternative to PCIe when it comes to very high bandwidth interconnects. 
It offers up to 300 Gbit/s data transfer rate per link. Server GPUs (Volta V100) have 
six links whereas consumer-grade GPUs (RTX 2080 Ti) have only one link, operating 
at a reduced 100 Gbit/s rate. We recommend to use NCCL?°? to achieve high data 
transfer between GPUs. 


13.4.7 More Latency Numbers 
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The summary in Table 13.4.1 and Table 13.4.2 are from Eliot Eshelman?°* who maintains 
an updated version of the numbers as a GitHub gist?°°. 


Table 13.4.1: Common Latency Numbers. 


Action Time | Notes 

L1 cache reference/hit 1.5ns | 4 cycles 
Floating-point add/mult/FMA 1.5ns | 4 cycles 

L2 cache reference/hit S ns 12 ~ 17 cycles 
Branch mispredict 6 ns 15 ~ 20 cycles 
L3 cache hit (unshared cache) 16 ns 42 cycles 

L3 cache hit (shared in another core) 25 ns 65 cycles 
Mutex lock/unlock 25 ns 


L3 cache hit (modified in another core) 29 ns 75 cycles 


L3 cache hit (on a remote CPU socket) 40 ns 100 ~ 300 cycles (40 ~ 116 ns) 


QPI hop to a another CPU (per hop) 40 ns 

64MB memory ref. (local CPU) 46 ns TinyMemBench on Broadwell E5-2690v4 
64MB memory ref. (remote CPU) 70 ns TinyMemBench on Broadwell E5-2690v4 
256MB memory ref. (local CPU) 75 ns TinyMemBench on Broadwell E5-2690v4 
Intel Optane random write 94 ns UCSD Non- Volatile Systems Lab 

256MB memory ref. (remote CPU) 120 ns | TinyMemBench on Broadwell E5-2690v4 
Intel Optane random read 305 ns | UCSD Non-Volatile Systems Lab 

Send 4KB over 100 Gbps HPC fabric 1 us MVAPICH2 over Intel Omni-Path 
Compress 1KB with Google Snappy 3 us 

Send 4KB over 10 Gbps ethernet 10 ps 

Write 4KB randomly to NVMe SSD 30 us DC P3608 NVMe SSD (QOS 99% is 500us) 
Transfer 1MB to/from NVLink GPU 30 us | ~33GB/s on NVIDIA 40GB NVLink 
Transfer 1 MB to/from PCI-E GPU 80 us | ~12GB/s on PCIe 3.0 x16 link 


Read 4KB randomly from NVMe SSD 120 ps | DC P3608 NVMe SSD (QOS 99%) 
Read 1MB sequentially from NVMe SSD | 208 us | ~4.8GB/s DC P3608 NVMe SSD 


Write 4KB randomly to SATA SSD 500 ps | DC S3510 SATA SSD (QOS 99.9%) 
Read 4KB randomly from SATA SSD 500 ps | DC S3510 SATA SSD (QOS 99.9%) 
Round trip within same data center 500 us | One-way ping is ~250us 

Read 1MB sequentially from SATA SSD | 2 ms ~550MB/s DC S3510 SATA SSD 
Read 1MB sequentially from disk 5 ms ~200MB/s server HDD 

Random Disk Access (seek+rotation) 10 ms 

Send packet CA->Netherlands->CA 150 ms 


Table 13.4.2: Latency Numbers for NVIDIA Tesla GPUs. 


Hardware 


Action Time | Notes 
GPU Shared Memory access 30ns_ | 30~90 cycles (bank conflicts add la- 
tency) 
GPU Global Memory access 200 200~800 cycles 
ns 
Launch CUDA kernel on GPU 10 us | Host CPU instructs GPU to start kernel 
Transfer 1MB to/from NVLink | 30 us | ~33GB/s on NVIDIA 40GB NVLink 
GPU 
Transfer 1MB to/from PCI-E GPU | 80 us | ~12GB/s on PCI-Express x16 link 


13.4.8 Summary 


Devices have overheads for operations. Hence it is important to aim for a small number 
of large transfers rather than many small ones. This applies to RAM, SSDs, networks 
and GPUs. 


Vectorization is key for performance. Make sure you are aware of the specific abilities 
of your accelerator. E.g., some Intel Xeon CPUs are particularly good for INT8 op- 
erations, NVIDIA Volta GPUs excel at FP16 matrix-matrix operations and NVIDIA 
Turing shines at FP16, INT8, and INT4 operations. 


Numerical overflow due to small data types can be a problem during training (and to a 
lesser extent during inference). 


Aliasing can significantly degrade performance. For instance, memory alignment on 64 
bit CPUs should be done with respect to 64 bit boundaries. On GPUs it is a good idea 
to keep convolution sizes aligned, e.g., to tensor cores. 


Match your algorithms to the hardware (e.g., memory footprint, and bandwidth). Great 
speedup (orders of magnitude) can be achieved when fitting the parameters into caches. 


We recommend that you sketch out the performance of a novel algorithm on paper before 
verifying the experimental results. Discrepancies of an order-of-magnitude or more 
are reasons for concern. 


Use profilers to debug performance bottlenecks. 


Training and inference hardware have different sweet spots in terms of price and perfor- 
mance. 


13.4.9 Exercises 


. Write C code to test whether there is any difference in speed between accessing memory 


aligned or misaligned relative to the external memory interface. Hint: be careful of 
caching effects. 


. Test the difference in speed between accessing memory in sequence or with a given 
stride. 
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. How could you measure the cache sizes on a CPU? 


. How would you lay out data across multiple memory channels for maximum bandwidth? 


How would you lay it out if you had many small threads? 


. An enterprise-class HDD is spinning at 10,000 rpm. What is the absolutely minimum 


time an HDD needs to spend worst case before it can read data (you can assume that 
heads move almost instantaneously)? Why are 2.5” HDDs becoming popular for com- 
mercial servers (relative to 3.5” and 5.25” drives)? 


. Assume that an HDD manufacturer increases the storage density from 1 Tbit per square 


inch to 5 Tbit per square inch. How much information can you store on a ring on a 2.5” 
HDD? Is there a difference between the inner and outer tracks? 


. Going from 8 bit to 16 bit data types increases the amount of silicon approximately by 


four times. Why? Why might NVIDIA have added INT4 operations to their Turing 
GPUs? 


. How much faster is it to read forward through memory vs. reading backwards? Does 


this number differ between different computers and CPU vendors? Why? Write C code 
and experiment with it. 


. Can you measure the cache size of your disk? What is it for a typical HDD? Do SSDs 


need a cache? 


Measure the packet overhead when sending messages across the Ethernet. Look up the 
difference between UDP and TCP/IP connections. 


Direct memory access allows devices other than the CPU to write (and read) directly to 
(from) memory. Why is this a good idea? 


Look at the performance numbers for the Turing T4 GPU. Why does the performance 
“only” double as you go from FP16 to INT8 and INT4? 


What is the shortest time it should take for a packet on a round trip between San Fran- 
cisco and Amsterdam? Hint: you can assume that the distance is 10,000 km. 


06 


13.5 Training on Multiple GPUs 


So far we discussed how to train models efficiently on CPUs and GPUs. We even showed 
how deep learning frameworks allow one to parallelize computation and communication 
automatically between them in Section 13.3. We also showed in Section 6.7 how to list 
all the available GPUs on a computer using the nvidia-~smi command. What we did not 
discuss is how to actually parallelize deep learning training. Instead, we implied in pass- 
ing that one would somehow split the data across multiple devices and make it work. The 
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present section fills in the details and shows how to train a network in parallel when starting 
from scratch. Details on how to take advantage of functionality in high-level APIs is rele- 
gated to Section 13.6. We assume that you are familiar with minibatch stochastic gradient 
descent algorithms such as the ones described in Section 12.5. 


13.5.1 Splitting the Problem 


Let’s start with a simple computer vision problem and a slightly archaic network, e.g., with 
multiple layers of convolutions, pooling, and possibly a few fully connected layers in the 
end. That is, let’s start with a network that looks quite similar to LeNet (LeCun et al., 
1998) or AlexNet (Krizhevsky et al., 2012). Given multiple GPUs (2 if it is a desktop 
server, 4 on an AWS g4dn.12xlarge instance, 8 on a p3.16xlarge, or 16 on a p2.1 6xlarge), 
we want to partition training in a manner as to achieve good speedup while simultaneously 
benefitting from simple and reproducible design choices. Multiple GPUs, after all, increase 
both memory and computation ability. In a nutshell, we have the following choices, given 
a minibatch of training data that we want to classify. 


First, we could partition the network across multiple GPUs. That is, each GPU takes as 
input the data flowing into a particular layer, processes data across a number of subsequent 
layers and then sends the data to the next GPU. This allows us to process data with larger 
networks when compared with what a single GPU could handle. Besides, memory footprint 
per GPU can be well controlled (it is a fraction of the total network footprint). 


However, the interface between layers (and thus GPUs) requires tight synchronization. This 
can be tricky, in particular if the computational workloads are not properly matched be- 
tween layers. The problem is exacerbated for large numbers of GPUs. The interface be- 
tween layers also requires large amounts of data transfer, such as activations and gradients. 
This may overwhelm the bandwidth of the GPU buses. Moreover, compute-intensive, yet 
sequential operations are nontrivial to partition. See e.g., Mirhoseini et al. (2017) for a best 
effort in this regard. It remains a difficult problem and it is unclear whether it is possible 
to achieve good (linear) scaling on nontrivial problems. We do not recommend it unless 
there is excellent framework or operating system support for chaining together multiple 
GPUs. 


Second, we could split the work layerwise. For instance, rather than computing 64 channels 
on a single GPU we could split up the problem across 4 GPUs, each of which generates data 
for 16 channels. Likewise, for a fully connected layer we could split the number of output 
units. Fig. 13.5.1 (taken from Krizhevsky et al. (2012)) illustrates this design, where this 
strategy was used to deal with GPUs that had a very small memory footprint (2 GB at the 
time). This allows for good scaling in terms of computation, provided that the number of 
channels (or units) is not too small. Besides, multiple GPUs can process increasingly larger 
networks since the available memory scales linearly. 


However, we need a very large number of synchronization or barrier operations since each 
layer depends on the results from all the other layers. Moreover, the amount of data that 
needs to be transferred is potentially even larger than when distributing layers across GPUs. 
Thus, we do not recommend this approach due to its bandwidth cost and complexity. 
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Model parallelism in the original AlexNet design due to limited GPU memory. 


Last, we could partition data across multiple GPUs. This way all GPUs perform the same 
type of work, albeit on different observations. Gradients are aggregated across GPUs after 
each minibatch of training data. This is the simplest approach and it can be applied in any 
situation. We only need to synchronize after each minibatch. That said, it is highly desirable 
to start exchanging gradients parameters already while others are still being computed. 
Moreover, larger numbers of GPUs lead to larger minibatch sizes, thus increasing training 
efficiency. However, adding more GPUs does not allow us to train larger models. 
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Parallelization on multiple GPUs. From left to right: original problem, network 
partitioning, layerwise partitioning, data parallelism. 


A comparison of different ways of parallelization on multiple GPUs is depicted in Fig. 
13.5.2. By and large, data parallelism is the most convenient way to proceed, provided 
that we have access to GPUs with sufficiently large memory. See also (Li et al., 2014) for 
a detailed description of partitioning for distributed training. GPU memory used to be a 
problem in the early days of deep learning. By now this issue has been resolved for all but 
the most unusual cases. We focus on data parallelism in what follows. 


13.5.2 Data Parallelism 


Assume that there are k GPUs on a machine. Given the model to be trained, each GPU will 
maintain a complete set of model parameters independently though parameter values across 
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the GPUs are identical and synchronized. As an example, Fig. 13.5.3 illustrates training 
with data parallelism when k = 2. 
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J) Calculation of minibatch stochastic gradient descent using data parallelism on two GPUs. 


In general, the training proceeds as follows: 


e In any iteration of training, given a random minibatch, we split the examples in the batch 
into k portions and distribute them evenly across the GPUs. 


e Each GPU calculates loss and gradient of the model parameters based on the minibatch 
subset it was assigned. 


e The local gradients of each of the k GPUs are aggregated to obtain the current minibatch 
stochastic gradient. 


e The aggregate gradient is re-distributed to each GPU. 


e Each GPU uses this minibatch stochastic gradient to update the complete set of model 
parameters that it maintains. 


Note that in practice we increase the minibatch size k-fold when training on k GPUs such 
that each GPU has the same amount of work to do as if we were training on a single GPU 
only. On a 16-GPU server this can increase the minibatch size considerably and we may 
have to increase the learning rate accordingly. Also note that batch normalization in Section 
8.5 needs to be adjusted, e.g., by keeping a separate batch normalization coefficient per 
GPU. In what follows we will use a toy network to illustrate multi-GPU training. 


zmatplotlib inline 

import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


13.5.3 A Toy Network 


We use LeNet as introduced in Section 7.6 (with slight modifications). We define it from 
scratch to illustrate parameter exchange and synchronization in detail. 
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# Initialize model parameters 


scale = 0.01 

W1 = torch.randn(size=(20, 1, 3, 3)) * scale 
b1 = torch. zeros(2Q) 

W2 = torch.randn(size=(5@, 20, 5, 5)) * scale 
b2 = torch.zeros(5Q) 

W3 = torch.randn(size=(800, 128)) * scale 

b3 = torch.zeros(128) 

W4 = torch.randn(size=(128, 10)) * scale 

b4 = torch.zeros(10) 


params = [W1, b1, W2, b2, W3, b3, W4, b4] 


# Define the model 
def lenet(X, params): 
= F.conv2d(input=X, weight=params[90], bias=params[1]) 


h1_conv 
hi_acti 


vation = F.relu(hl_conv) 


h1 = F.avg_pool2d(input=hl_activation, kernel_size=(2, 2), stride=(2, 2)) 
h2_conv = F.conv2d(input=h1, weight=params[2], bias=params[3]) 


h2_acti 


vation = F.relu(h2_conv) 


h2 = F.avg_pool2d(input=h2_activation, kernel_size=(2, 2), stride=(2, 2)) 


h2 = h2 
h3_line 
h3 = F. 
y_hat = 
return 


.reshape(h2.shapelQ], -1) 

ar = torch.mm(h2, params[4]) + params[5] 
relu(h3_linear) 

torch.mm(h3, params[6]) + params[7] 
y_hat 


# Cross-entropy loss function 
loss 


= nn.C 


rossEntropyLoss(reduction='none') 


13.5.4 Data Synchronization 


For efficient multi-GPU training we need two basic operations. First we need to have 
the ability to distribute a list of parameters to multiple devices and to attach gradients 
(get_params). Without parameters it is impossible to evaluate the network on a GPU. 
Second, we need the ability to sum parameters across multiple devices, i.e., we need an 
allreduce function. 


def get_params(params, device): 


Let’s try it out by copying the model parameters to one GPU. 


new_par 
for pi 

Da 
return 


ams = [p.to(device) for p in params] 
n new_params: 

equires_grad_() 

new_params 


new_params = get_params(params, d21.try_gpu(@)) 
print('’b1 weight:', new_params[1]) 
print(’b1 grad:', new_params[1]. grad) 


b1 weight: tensor([0., ©., ©., ©., ©., 0., ©., @., @., ©., ©., @., @., @., Opa 


(continues on next page) 
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(continued from previous page) 


0., 0., ©., ©., @], 
device='cuda:0', requires_grad=True) 
b1 grad: None 


Since we did not perform any computation yet, the gradient with regard to the bias param- 
eter is still zero. Now let’s assume that we have a vector distributed across multiple GPUs. 
The following allreduce function adds up all vectors and broadcasts the result back to all 
GPUs. Note that for this to work we need to copy the data to the device accumulating the 
results. 


def allreduce(data): 
for i in range(1, len(data)): 
data[0J[:] += dataLi].to(data[0].device) 
for i in range(1, len(data)): 
data[i][:] = datal0].to(data[li].device) 


Let’s test this by creating vectors with different values on different devices and aggregate 
them. 


data = [torch.ones((1, 2), device=d21.try_gpu(i)) * (i + 1) for i in range(2)] 
print(’before allreduce:\n’, datal@], '\n’, data[1]) 

allreduce(data) 

print(’after allreduce:\n', data[@], '\n', data[1]) 


before allreduce: 


tensor([[1., 1.]], device='cuda:0') 
tensor([[2., 2.]], device=’cuda:1') 
after allreduce: 

tensor([[3., 3.]], device=’cuda:@') 
tensor([[3., 3.]], device=’cuda:1') 


13.5.5 Distributing Data 


We need a simple utility function to distribute a minibatch evenly across multiple GPUs. 
For instance, on two GPUs we would like to have half of the data to be copied to either of 
the GPUs. Since it is more convenient and more concise, we use the built-in function from 
the deep learning framework to try it out on a 4 x 5 matrix. 


data = torch. arange(20).reshape(4, 5) 

devices = [torch.device(’cuda:@'), torch.device('cuda:1')] 
split = nn.parallel.scatter(data, devices) 

print(’input :’, data) 

print(’load into’, devices) 

print(’output:'’, split) 
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input : tensor([[ 0, 1, 2, 3, 41], 
[ 5, 6, 7, 8; 91, 
[10, 11, 12, 13, 14], 
[15, 16, 17, 18, 19]]) 
load into [device(type='cuda’, index=@), device(type='’cuda’, index=1)] 
output: (tensor([[@, 1, 2, 3, 4], 
[5, 6, 7, 8, 9]], device='cuda:@'), tensor([[10, 11, 12, 13, 14], 
[15, 16, 17, 18, 19]], device='cuda:1’)) 


For later reuse we define a sp1it_batch function that splits both data and labels. 


#@save 
def split_batch(X, y, devices): 
"""Split ‘X* and ‘y* into multiple devices. 
assert X.shape[@] == y.shapeLQ] 
return (nn.parallel.scatter(X, devices), 
nn.parallel.scatter(y, devices)) 


nnn 


13.5.6 Training 


Now we can implement multi-GPU training on a single minibatch. Its implementation is 
primarily based on the data parallelism approach described in this section. We will use the 
auxiliary functions we just discussed, allreduce and split_and_load, to synchronize the 
data among multiple GPUs. Note that we do not need to write any specific code to achieve 
parallelism. Since the computational graph does not have any dependencies across devices 
within a minibatch, it is executed in parallel automatically. 


def train_batch(X, y, device_params, devices, Ir): 
X_shards, y_shards = split_batch(X, y, devices) 
# Loss is calculated separately on each GPU 
ls = Lloss(lenet(X_shard, device_W), y_shard).sum() 
for X_shard, y_shard, device_W in zip( 
X_shards, y_shards, device_params) ] 
for 1 in ls: # Backpropagation is performed separately on each GPU 
1. backward() 
# Sum all gradients from each GPU and broadcast them to all GPUs 
with torch.no_grad(): 
for i in range(len(device_params[@])): 
allreduce([device_params[c][i].grad for c in range(len(devices))]) 
# The model parameters are updated separately on each GPU 
for param in device_params: 
d21.sgd(param, lr, X.shapel0]) # Here, we use a full-size batch 


Now, we can define the training function. It is slightly different from the ones used in the 
previous chapters: we need to allocate the GPUs and copy all the model parameters to all 
the devices. Obviously each batch is processed using the train_batch function to deal 
with multiple GPUs. For convenience (and conciseness of code) we compute the accuracy 
on a single GPU, though this is inefficient since the other GPUs are idle. 
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def train(num_gpus, batch_size, Ir): 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 
devices = [d2l.try_gpu(i) for i in range(num_gpus) ] 
# Copy model parameters to ‘num_gpus* GPUs 
device_params = [get_params(params, d) for d in devices] 
num_epochs = 10 
animator = d21.Animator(’epoch', ‘test acc’, xlim=[1, num_epochs]) 
timer = d21.Timer() 
for epoch in range(num_epochs) : 
timer.start() 
for X, y in train_iter: 
# Perform multi-GPU training for a single minibatch 
train_batch(X, y, device_params, devices, 1r) 
torch.cuda. synchronize() 
timer.stop() 
# Evaluate the model on GPU @ 
animator.add(epoch + 1, (d21.evaluate_accuracy_gpu( 
lambda x: lenet(x, device_params[0]), test_iter, devices[Q]),)) 
print(f'test acc: {animator.Y[Q@][-1]:.2f}, {timer.avg():.1f} sec/epoch ' 
f'on {str(devices) }') 


Let’s see how well this works on a single GPU. We first use a batch size of 256 and a 
learning rate of 0.2. 


train(num_gpus=1, batch_size=256, 1r=0.2) 


test acc: 0.83, 3.0 sec/epoch on [device(type='cuda', index=0)] 
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By keeping the batch size and learning rate unchanged and increasing the number of GPUs 
to 2, we can see that the test accuracy roughly stays the same compared with the previous 
experiment. In terms of the optimization algorithms, they are identical. Unfortunately there 
is no meaningful speedup to be gained here: the model is simply too small; moreover we 
only have a small dataset, where our slightly unsophisticated approach to implementing 
multi-GPU training suffered from significant Python overhead. We will encounter more 
complex models and more sophisticated ways of parallelization going forward. Let’s see 
what happens nonetheless for Fashion-MNIST. 


580 Computational Performance 


train(num_gpus=2, batch_size=256, lr=0.2) 


test acc: 0.84, 2.8 sec/epoch on [device(type='cuda’, index=@), device(type= 
«+ 'cuda’, index=1)] 
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13.5.7 Summary 


There are multiple ways to split deep network training over multiple GPUs. We could 
split them between layers, across layers, or across data. The former two require tightly 
choreographed data transfers. Data parallelism is the simplest strategy. 


Data parallel training is straightforward. However, it increases the effective minibatch 
size to be efficient. 


In data parallelism, data is split across multiple GPUs, where each GPU executes its 
own forward and backward operation and subsequently gradients are aggregated and 
results are broadcast back to the GPUs. 


e We may use slightly increased learning rates for larger minibatches. 


13.5.8 Exercises 


1. When training on k GPUs, change the minibatch size from b to k - b, i.e., scale it up by 
the number of GPUs. 


2. Compare accuracy for different learning rates. How does it scale with the number of 
GPUs? 


3. Implement a more efficient all reduce function that aggregates different parameters on 
different GPUs? Why is it more efficient? 


4. Implement multi-GPU test accuracy computation. 


Discussions 2° . 
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13.6 Concise Implementation for Multiple GPUs 


Implementing parallelism from scratch for every new model is no fun. Moreover, there is 
significant benefit in optimizing synchronization tools for high performance. In the follow- 
ing we will show how to do this using high-level APIs of deep learning frameworks. The 
mathematics and the algorithms are the same as in Section 13.5. Quite unsurprisingly you 
will need at least two GPUs to run code of this section. 


import torch 
from torch import nn 
from d21 import torch as d21 


13.6.1 A Toy Network 


Let’s use a slightly more meaningful network than LeNet from Section 13.5 that is still 
sufficiently easy and quick to train. We pick a ResNet-18 variant (He et al., 2016). Since 
the input images are tiny we modify it slightly. In particular, the difference from Section 8.6 
is that we use a smaller convolution kernel, stride, and padding at the beginning. Moreover, 
we remove the max-pooling layer. 


#@save 
def resnet18(num_classes, in_channels=1): 
"""A slightly modified ResNet-18 model.””” 
def resnet_block(in_channels, out_channels, num_residuals, 
first_block=False): 
blk = [] 
for i in range(num_residuals): 
if i == @ and not first_block: 
blk. append(d21.Residual(out_channels, use_1xlconv=True, 
strides=2)) 
else: 
blk. append(d21.Residual(out_channels)) 
return nn.Sequential (*blk) 


# This model uses a smaller convolution kernel, stride, and padding and 
# removes the max-pooling layer 
net = nn.Sequential( 

nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1), 

nn.BatchNorm2d(64) , 

nn.ReLU()) 
net .add_module("resnet_blocki”, resnet_block(64, 64, 2, first_block=True)) 
net .add_module("resnet_block2”, resnet_block(64, 128, 2)) 
net .add_module("resnet_block3”, resnet_block(128, 256, 2)) 
net .add_module("resnet_block4", resnet_block(256, 512, 2)) 
net.add_module("global_avg_pool”, nn.AdaptiveAvgPool2d((1,1))) 
net.add_module("fc”, nn.Sequential(nn.Flatten(), 

nn.Linear(512, num_classes))) 

return net 
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13.6.2 Network Initialization 


We will initialize the network inside the training loop. For a refresher on initialization 
methods see Section 5.4. 


net = resnet18(10) 

# Get a list of GPUs 

devices = d21.try_all_gpus() 

# We will initialize the network inside the training loop 


13.6.3 Training 


As before, the training code needs to perform several basic functions for efficient paral- 
lelism: 


e Network parameters need to be initialized across all devices. 

e While iterating over the dataset minibatches are to be divided across all devices. 
e We compute the loss and its gradient in parallel across devices. 

e Gradients are aggregated and parameters are updated accordingly. 


In the end we compute the accuracy (again in parallel) to report the final performance of 
the network. The training routine is quite similar to implementations in previous chapters, 
except that we need to split and aggregate data. 


def train(net, num_gpus, batch_size, Ir): 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 
devices = [d21.try_gpu(i) for i in range(num_gpus) ] 
def init_weights(module): 
if type(module) in [nn.Linear, nn.Conv2d]: 
nn.init.normal_(module.weight, std=0.01) 
net.apply(init_weights) 
# Set the model on multiple GPUs 
net = nn.DataParallel(net, device_ids=devices) 
trainer = torch.optim.SGD(net.parameters(), 1r) 
loss = nn.CrossEntropyLoss() 
timer, num_epochs = d21.Timer(), 10 
animator = d21.Animator(’epoch', ‘test acc’, xlim=[1, num_epochs]) 
for epoch in range(num_epochs) : 
net.train() 
timer.start() 
for X, y in train_iter: 
trainer.zero_grad() 
X, y = X.to(devices[0]), y.to(devices[@]) 
1 = loss(net(X), y) 
1. backward() 
trainer.step() 
timer.stop() 
animator.add(epoch + 1, (d21.evaluate_accuracy_gpu(net, test_iter),)) 
print(f'test acc: {animator.Y[Q@][-1]:.2f}, {timer.avg():.1f} sec/epoch ' 
f'on {str(devices) }') 
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Let’s see how this works in practice. As a warm-up we train the network on a single 
GPU. 


train(net, num_gpus=1, batch_size=256, lr=0.1) 


test acc: @.91, 12.2 sec/epoch on [device(type='cuda', index=0)] 
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Next we use 2 GPUs for training. Compared with LeNet evaluated in Section 13.5, the 
model for ResNet-18 is considerably more complex. This is where parallelization shows 
its advantage. The time for computation is meaningfully larger than the time for synchro- 
nizing parameters. This improves scalability since the overhead for parallelization is less 


relevant. 


train(net, num_gpus=2, batch_size=512, 1r=@.2) 


test acc: 0.73, 7.5 sec/epoch on [device(type='cuda’, index=0), device(type= 
~'cuda’, index=1)] 
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13.6.4 Summary 
e Data is automatically evaluated on the devices where the data can be found. 


e Take care to initialize the networks on each device before trying to access the parameters 
on that device. Otherwise you will encounter an error. 


e The optimization algorithms automatically aggregate over multiple GPUs. 
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13.6.5 Exercises 


1. This section uses ResNet-18. Try different epochs, batch sizes, and learning rates. Use 
more GPUs for computation. What happens if you try this with 16 GPUs (e.g., on an 
AWS p2.16xlarge instance)? 


2. Sometimes, different devices provide different computing power. We could use the 
GPUs and the CPU at the same time. How should we divide the work? Is it worth the 
effort? Why? Why not? 


Discussions?’ , 


13.7 Parameter Servers 
D 


As we move from a single GPU to multiple GPUs and then to multiple servers containing 
multiple GPUs, possibly all spread out across multiple racks and network switches, our 
algorithms for distributed and parallel training need to become much more sophisticated. 
Details matter since different interconnects have very different bandwidth (e.g., NVLink 
can offer up to 100 GB/s across 6 links in an appropriate setting, PCIe 4.0 (16-lane) offers 
32 GB/s, while even high speed 100GbE Ethernet only amounts to 10 GB/s). At the same 
time it is unreasonable to expect that a statistical modeler be an expert in networking and 
systems. 


The core idea of the parameter server was introduced in Smola and Narayanamurthy (2010) 
in the context of distributed latent variable models. A description of the push and pull 
semantics then followed in Ahmed et al. (2012) and a description of the system and an 
open source library followed in Li et al. (2014). In the following we will motivate the 
components needed for efficiency. 


13.7.1 Data-Parallel Training 


Let’s review the data parallel training approach to distributed training. We will use this 
to the exclusion of all others in this section since it is significantly simpler to implement 
in practice. There are virtually no use cases (besides deep learning on graphs) where any 
other strategy for parallelism is preferred since GPUs have plenty of memory nowadays. 
Fig. 13.7.1 describes the variant of data parallelism that we implemented in Section 13.5. 
The key aspect in it is that the aggregation of gradients occurs on one single GPU (GPU 0) 
before the updated parameters are rebroadcast to all GPUs. 


In retrospect, the decision to aggregate on GPU 0 seems rather ad-hoc. After all, we might 
just as well aggregate on the CPU. In fact, we could even decide to aggregate some of 
the parameters on one GPU and some others on another. Provided that the optimization 
algorithm supports this, there is no real reason for why we could not. For instance, if we 
have four parameter vectors with associated gradients g1, ..., g4 we could aggregate the 
gradients on one GPU for each g; (i = 1,...,4). 
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Left: single GPU training. Right: a variant of multi-GPU training: (1) we compute loss 
and gradient, (2) all gradients are aggregated on one GPU, (3) parameter update happens 
and the parameters are re-distributed to all GPUs. 


This reasoning seems arbitrary and frivolous. After all, the mathematics is the same through- 
out. However, we are dealing with real physical hardware where different buses have differ- 
ent bandwidth as discussed in Section 13.4. Consider a real 4-way GPU server as described 
in Fig. 13.7.2. If it is particularly well connected, it might have a 100 GbE network card. 
More typical numbers are in the 1—10 GbE range with an effective bandwidth of 100 MB/s 
to 1 GB/s. Since the CPUs have too few PCIe lanes to connect to all GPUs directly (e.g., 
consumer-grade Intel CPUs have 24 lanes) we need a multiplexer?°°. The bandwidth from 
the CPU on a 16x Gen3 link is 16 GB/s. This is also the speed at which each of the GPUs 
is connected to the switch. This means that it is more effective to communicate between 


the devices. 


A 
network p a 
switch 
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CPU 

PCle 3.0 16x 
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A 4-way GPU server. 


586 


Computational Performance 


For the sake of the argument let’s assume that the gradients are of 160 MB. In this case 
it takes 30 ms to send the gradients from all 3 remaining GPUs to the fourth one (each 
transfer takes 10 ms = 160 MB / 16 GB/s). Adding another 30 ms to transmit the weight 
vectors back we arrive at a total of 60 ms. If we send all data to the CPU we incur a penalty 
of 40 ms since each of the four GPUs needs to send the data to the CPU, yielding a total 
of 80 ms. Lastly assume that we are able to split the gradients into 4 parts of 40 MB each. 
Now we can aggregate each of the parts on a different GPU simultaneously since the PCIe 
switch offers a full-bandwidth operation between all links. Instead of 30 ms this takes 7.5 
ms, yielding a total of 15 ms for a synchronization operation. In short, depending on how 
we synchronize parameters the same operation can take anywhere from 15 ms to 80 ms. 
Fig. 13.7.3 depicts the different strategies for exchanging parameters. 
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Parameter synchronization strategies. 


Note that we have yet another tool at our disposal when it comes to improving performance: 
in a deep network it takes some time to compute all gradients from the top to the bottom. 
We can begin synchronizing gradients for some parameter groups even while we are still 
busy computing them for others. See e.g., Sergeev and Del Balso (2018) for details on how 
to do this in Horovod?!°, 


13.7.2 Ring Synchronization 


When it comes to synchronization on modern deep learning hardware we often encounter 

significantly bespoke network connectivity. For instance, the AWS p3.16xlarge and NVIDIA 
DGX-2 instances share the connectivity structure of Fig. 13.7.4. Each GPU connects to a 

host CPU via a PCIe link which operates at best at 16 GB/s. Additionally each GPU also 

has 6 NVLink connections, each of which is capable of transferring 300 Gbit/s bidirec- 

tionally. This amounts to around 18 GB/s per link per direction. In short, the aggregate 

NVLink bandwidth is significantly higher than the PCIe bandwidth. The question is how 

to use it most efficiently. 


It turns out that the optimal synchronization strategy is to decompose the network into two 
rings and to use them to synchronize data directly (Wang et al., 2018). Fig. 13.7.5 illustrates 
that the network can be decomposed into one ring (1-2-3-4-5-6-7-8-1) with double NVLink 
bandwidth and into one (1-4-6-3-5-8-2-7-1) with regular bandwidth. Designing an efficient 
synchronization protocol in this case is nontrivial. 


Consider the following thought experiment: given a ring of n computing nodes (or GPUs) 
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ais ie) 72" NVLink connectivity on 8 V100 GPU servers (image courtesy of NVIDIA). 


a 13) 7s) | Decomposition of the NVLink network into two rings. 


we can send gradients from the first to the second node. There it is added to the local 
gradient and sent on to the third node, and so on. After n — 1 steps the aggregate gradient 
can be found in the last-visited node. That is, the time to aggregate gradients grows linearly 
with the number of nodes. But if we do this the algorithm is quite inefficient. After all, 
at any time there is only one of the nodes communicating. What if we broke the gradients 
into n chunks and started synchronizing chunk i starting at node i? Since each chunk is of 
size 1/n the total time is now (n — 1)/n ~ 1. In other words, the time spent to aggregate 
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gradients does not grow as we increase the size of the ring. This is quite an astonishing 
result. Fig. 13.7.6 illustrates the sequence of steps on n = 4 nodes. 


GPU GPU GPU i GPU GPU GPU i 


Ring synchronization across 4 nodes. Each node starts transmitting parts of gradients to 
its left neighbor until the assembled gradient can be found in its right neighbor. 


If we use the same example of synchronizing 160 MB across 8 V100 GPUs we arrive 
at approximately 2 - 160MB/(3 - 18GB/s) ~ 6ms. This is better than using the PCIe 
bus, even though we are now using 8 GPUs. Note that in practice these numbers are a 
bit worse, since deep learning frameworks often fail to assemble communication into large 
burst transfers. 


Note that there is a common misconception that ring synchronization is fundamentally 
different from other synchronization algorithms. The only difference is that the synchro- 
nization path is somewhat more elaborate when compared with a simple tree. 


13.7.3 Multi-Machine Training 


Distributed training on multiple machines adds a further challenge: we need to communi- 
cate with servers that are only connected across a comparatively lower bandwidth fabric that 
can be over an order of magnitude slower in some cases. Synchronization across devices is 
tricky. After all, different machines running training code will have subtly different speed. 
Hence we need to synchronize them if we want to use synchronous distributed optimization. 
Fig. 13.7.7 illustrates how distributed parallel training occurs. 


1. A (different) batch of data is read on each machine, split across multiple GPUs and 
transferred to GPU memory. There predictions and gradients are computed on each 
GPU batch separately. 


2. The gradients from all local GPUs are aggregated on one GPU (or parts of it are aggre- 
gated over different GPUs). 
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3. The gradients are sent to the CPUs. 


4. The CPUs send the gradients to a central parameter server which aggregates all the 
gradients. 


5. The aggregate gradients are then used to update the parameters and the updated param- 
eters are broadcast back to the individual CPUs. 


6. The information is sent to one (or multiple) GPUs. 


7. The updated parameters are spread across all GPUs. 
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Multi-machine multi-GPU distributed parallel training. 


Each of these operations seems rather straightforward. And, indeed, they can be carried 
out efficiently within a single machine. Once we look at multiple machines, though, we 
can see that the central parameter server becomes the bottleneck. After all, the bandwidth 
per server is limited, hence for m workers the time it takes to send all gradients to the 
server is O(m). We can break through this barrier by increasing the number of servers 
to n. At this point each server only needs to store O(1/n) of the parameters, hence the 
total time for updates and optimization becomes O(m/n). Matching both numbers yields 
constant scaling regardless of how many workers we are dealing with. In practice we use the 
same machines both as workers and as servers. Fig. 13.7.8 illustrates the design (see also 
(Li et al., 2014) for details). In particular, ensuring that multiple machines work without 
unreasonable delays is nontrivial. 


13.7.4 Key—Value Stores 


Implementing the steps required for distributed multi-GPU training in practice is nontrivial. 
This is why it pays to use a common abstraction, namely that of a key-value store with 
redefined update semantics. 
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multiple servers 


Top: a single parameter server is a bottleneck since its bandwidth is finite. Bottom: 
multiple parameter servers store parts of the parameters with aggregate bandwidth. 


Across many workers and many GPUs the computation for gradient 7 can be defined as 


Si = Ži » Bijk> (13.7.1) 
keworkers j €GPUs 
where g;;, is part of gradient i split on GPU j of worker k. The key aspect in this operation 
is that it is a commutative reduction, that is, it turns many vectors into one and the order in 
which the operation is applied does not matter. This is great for our purposes since we do 
not (need to) have fine grained control over when which gradient is received. Besides, note 
that this operation is independent among different i. 


This allows us to define the following two operations: push, which accumulates gradients, 
and pull, which retrieves aggregate gradients. Since we have many different sets of gra- 
dients (after all, we have many layers), we need to index the gradients with a key i. This 
similarity to key—value stores, such as the one introduced in Dynamo (DeCandia et al., 
2007) is not by coincidence. They, too, satisfy many similar characteristics, in particular 
when it comes to distributing the parameters across multiple servers. 


The push and pull operations for key-value stores are described as follows: 


e push(key, value) sends a particular gradient (the value) from a worker to a common 
storage. There the value is aggregated, e.g., by summing it up. 


e pull(key, value) retrieves an aggregate value from common storage, e.g., after combining 
the gradients from all workers. 


By hiding all the complexity about synchronization behind a simple push and pull operation 
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we can decouple the concerns of statistical modelers who want to be able to express opti- 
mization in simple terms and the system engineers who need to deal with the complexity 
inherent in distributed synchronization. 


13.7.5 Summary 


Synchronization needs to be highly adaptive to specific network infrastructure and con- 
nectivity within a server. This can make a significant difference to the time it takes to 
synchronize. 


Ring-synchronization can be optimal for p3 and DGX-2 servers. For others possibly not 
so much. 


A hierarchical synchronization strategy works well when adding multiple parameter 
servers for increased bandwidth. 


13.7.6 Exercises 


1. Can you increase the ring synchronization even further? Hint: you can send messages 
in both directions. 


2. Is it possible to allow asynchronous communication (while computation is still ongo- 
ing)? How does it affect performance? 


3. What if we lost a server during a long-running computation? How can we design a fault 
tolerance mechanism to avoid restarting the computation fully? 


Discussions?!!. 
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Whether it is medical diagnosis, self-driving vehicles, camera monitoring, or smart filters, 
many applications in the field of computer vision are closely related to our current and fu- 
ture lives. In recent years, deep learning has been the transformative power for advancing 
the performance of computer vision systems. It can be said that the most advanced com- 
puter vision applications are almost inseparable from deep learning. In view of this, this 
chapter will focus on the field of computer vision, and investigate methods and applications 
that have recently been influential in academia and industry. 


In Chapter 7 and Chapter 8, we studied various convolutional neural networks that are 
commonly used in computer vision, and applied them to simple image classification tasks. 
At the beginning of this chapter, we will describe two methods that may improve model 
generalization, namely image augmentation and fine-tuning, and apply them to image clas- 
sification. Since deep neural networks can effectively represent images in multiple lev- 
els, such layerwise representations have been successfully used in various computer vision 
tasks such as object detection, semantic segmentation, and style transfer. Following the key 
idea of leveraging layerwise representations in computer vision, we will begin with major 
components and techniques for object detection. Next, we will show how to use fully con- 
volutional networks for semantic segmentation of images. Then we will explain how to use 
style transfer techniques to generate images like the cover of this book. In the end, we con- 
clude this chapter by applying the materials of this chapter and several previous chapters 
on two popular computer vision benchmark datasets. 


14.1 Image Augmentation 
| 


In Section 8.1, we mentioned that large datasets are a prerequisite for the success of deep 
neural networks in various applications. Image augmentation generates similar but distinct 
training examples after a series of random changes to the training images, thereby expand- 
ing the size of the training set. Alternatively, image augmentation can be motivated by the 
fact that random tweaks of training examples allow models to rely less on certain attributes, 
thereby improving their generalization ability. For example, we can crop an image in dif- 
ferent ways to make the object of interest appear in different positions, thereby reducing 
the dependence of a model on the position of the object. We can also adjust factors such as 
brightness and color to reduce a model’s sensitivity to color. It is probably true that image 


593 


Image Augmentation 


augmentation was indispensable for the success of AlexNet at that time. In this section we 
will discuss this widely used technique in computer vision. 


%matplotlib inline 

import torch 

import torchvision 

from torch import nn 

from d21 import torch as d21 


14.1.1 Common Image Augmentation Methods 


In our investigation of common image augmentation methods, we will use the following 
400 x 500 image an example. 


d21.set_figsize() 
img = d21.Image.open(’../img/cat1. jpg’) 
d21.plt.imshow(img) ; 


Most image augmentation methods have a certain degree of randomness. To make it easier 
for us to observe the effect of image augmentation, next we define an auxiliary function 
apply. This function runs the image augmentation method aug multiple times on the input 
image img and shows all the results. 


def apply(img, aug, num_rows=2, num_cols=4, scale=1.5): 
Y = [Laug(img) for _ in range(num_rows x num_cols)] 
d21.show_images(Y, num_rows, num_cols, scale=scale) 


Flipping and Cropping 


Flipping the image left and right usually does not change the category of the object. This is 
one of the earliest and most widely used methods of image augmentation. Next, we use the 
transforms module to create the RandomHorizontalFlip instance, which flips an image 
left and right with a 50% chance. 


apply(img, torchvision. transforms .RandomHorizontalFlip()) 


Flipping up and down is not as common as flipping left and right. But at least for this 
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example image, flipping up and down does not hinder recognition. Next, we create a Ran- 
domVerticalFlip instance to flip an image up and down with a 50% chance. 


apply(img, torchvision.transforms.RandomVerticalFlip()) 


In the example image we used, the cat is in the middle of the image, but this may not be the 
case in general. In Section 7.5, we explained that the pooling layer can reduce the sensitivity 
of a convolutional layer to the target position. In addition, we can also randomly crop the 
image to make objects appear in different positions in the image at different scales, which 
can also reduce the sensitivity of a model to the target position. 


In the code below, we randomly crop an area with an area of 10% ~ 100% of the original 
area each time, and the ratio of width to height of this area is randomly selected from 0.5 ~ 
2. Then, the width and height of the region are both scaled to 200 pixels. Unless otherwise 
specified, the random number between a and b in this section refers to a continuous value 
obtained by random and uniform sampling from the interval [a, b]. 


shape_aug = torchvision. transforms .RandomResizedCrop( 
(200, 200), scale=(@.1, 1), ratio=(0.5, 2)) 
apply(img, shape_aug) 


Changing Colors 


Another augmentation method is changing colors. We can change four aspects of the image 
color: brightness, contrast, saturation, and hue. In the example below, we randomly change 
the brightness of the image to a value between 50% (1 — 0.5) and 150% (1 + 0.5) of the 
original image. 
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apply(img, torchvision.transforms.ColorJitter( 
brightness=0.5, contrast=0, saturation=0, hue=0)) 


Similarly, we can randomly change the hue of the image. 


apply(img, torchvision.transforms.ColorJitter( 
brightness=0, contrast=0, saturation=0, hue=0.5)) 


We can also create a RandomColorJitter instance and set how to randomly change the 
brightness, contrast, saturation, and hue of the image at the same time. 


color_aug = torchvision.transforms.ColorJitter( 
brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5) 
apply(img, color_aug) 
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Combining Multiple Image Augmentation Methods 


In practice, we will combine multiple image augmentation methods. For example, we can 
combine the different image augmentation methods defined above and apply them to each 
image via a Compose instance. 


augs = torchvision. transforms. Compose([ 
torchvision. transforms.RandomHorizontalFlip(), color_aug, shape_aug]) 
apply(img, augs) 


14.1.2 Training with Image Augmentation 


Let’s train a model with image augmentation. Here we use the CIFAR-10 dataset instead 
of the Fashion-MNIST dataset that we used before. This is because the position and size 
of the objects in the Fashion-MNIST dataset have been normalized, while the color and 
size of the objects in the CIFAR-10 dataset have more significant differences. The first 32 
training images in the CIFAR-10 dataset are shown below. 


all_images = torchvision.datasets.CIFAR1@(train=True, root="../data”, 
download=True) 
d21.show_images([all_images[i][@] for i in range(32)], 4, 8, scale=0.8); 


Downloading https://www.cs. toronto. edu/~kriz/cifar-10-python.tar.gz to ../data/ 
~cifar-10-python. tar.gz 

100% | | 170498071/170498071 [00:04<00:00, 37716809.52it/s] 

Extracting ../data/cifar-10-python.tar.gz to ../data 


In order to obtain definitive results during prediction, we usually only apply image aug- 
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mentation to training examples, and do not use image augmentation with random opera- 
tions during prediction. Here we only use the simplest random left-right flipping method. 
In addition, we use a ToTensor instance to convert a minibatch of images into the format 
required by the deep learning framework, i.e., 32-bit floating point numbers between 0 and 
1 with the shape of (batch size, number of channels, height, width). 


train_augs = torchvision. transforms .Compose(L 
torchvision. transforms.RandomHorizontalFlip(), 
torchvision. transforms. ToTensor() ]) 


test_augs = torchvision. transforms .Compose(L 
torchvision. transforms. ToTensor()]) 


Next, we define an auxiliary function to facilitate reading the image and applying image 
augmentation. The transform argument provided by PyTorch’s dataset applies augmen- 
tation to transform the images. For a detailed introduction to DataLoader, please refer to 
Section 4.2. 


def load_cifar1@(is_train, augs, batch_size): 
dataset = torchvision.datasets.CIFAR10(root="../data”, train=is_train, 
transform=augs, download=True) 
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
shuffle=is_train, num_workers=d21.get_dataloader_workers()) 
return dataloader 


Multi-GPU Training 


We train the ResNet-18 model from Section 8.6 on the CIFAR-10 dataset. Recall the in- 
troduction to multi-GPU training in Section 13.6. In the following, we define a function to 
train and evaluate the model using multiple GPUs. 


#@save 
def train_batch_chl3(net, X, y, loss, trainer, devices): 
"""Train for a minibatch with multiple GPUs (defined in Chapter 13).""” 
if isinstance(X, list): 
# Required for BERT fine-tuning (to be covered later) 
X = [x.to(devices[0]) for x in X] 


(continues on next page) 
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(continued from previous page) 


else: 
X = X.to(devices[0]) 
y = y.to(devices[0]) 
net.train() 
trainer.zero_grad() 
pred = net(X) 
1 = loss(pred, y) 
1.sum() .backward() 
trainer.step() 
train_loss_sum = 1.sum() 
train_acc_sum = d21.accuracy(pred, y) 
return train_loss_sum, train_acc_sum 


#@save 
def train_chl3(net, train_iter, test_iter, loss, trainer, num_epochs, 


devices=d21.try_all_gpus()): 
"""Train a model with multiple GPUs (defined in Chapter 13).””” 
timer, num_batches = d21.Timer(), len(train_iter) 
animator = d21.Animator(xlabel='epoch’, xlim=[1, num_epochs], ylim=[2, 1], 
legend=['train loss', ‘train acc’, ‘test acc']) 
net = nn.DataParallel(net, device_ids=devices) .to(devices[@]) 
for epoch in range(num_epochs) : 
# Sum of training loss, sum of training accuracy, no. of examples, 
# no. of predictions 
metric = d21.Accumulator (4) 
for i, (features, labels) in enumerate(train_iter): 
timer.start() 
1, acc = train_batch_ch13( 
net, features, labels, loss, trainer, devices) 
metric.add(1, acc, labels.shapel@], labels.numel()) 
timer.stop() 
if (i + 1) % (num_batches // 5) == @ or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(metricl@] / metric[2], metric[1] / metric[3], 
None) ) 
test_acc = d21.evaluate_accuracy_gpu(net, test_iter) 
animator.add(epoch + 1, (None, None, test_acc)) 
print(f'loss {metriclQ] / metric[2]:.3f}, train acc 
f'{metric[1] / metric[3]:.3f}, test acc {test_acc: .3f}') 
print(f’{metricl2] * num_epochs / timer.sum():.1f} examples/sec on ' 
f'{str(devices) }’) 


1 


Now we can define the train_with_data_aug function to train the model with image 
augmentation. This function gets all available GPUs, uses Adam as the optimization algo- 
rithm, applies image augmentation to the training dataset, and finally calls the train_ch13 
function just defined to train and evaluate the model. 


batch_size, devices, net = 256, d21.try_all_gpus(), d21l.resnet18(10, 3) 
net.apply(d21.init_cnn) 


def train_with_data_aug(train_augs, test_augs, net, 1r=0.001): 


train_iter = load_cifarl@(True, train_augs, batch_size) 


(continues on next page) 
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test_iter = load_cifarl@(False, test_augs, batch_size) 

loss = nn.CrossEntropyLoss(reduction="none” 

trainer = torch.optim.Adam(net.parameters(), lr=1r) 

net (next (iter(train_iter))[0]) 

train_ch13(net, train_iter, test_iter, loss, trainer, 10, devices) 


Let’s train the model using image augmentation based on random left-right flipping. 


train_with_data_aug(train_augs, test_augs, net) 


loss 0.215, train acc 0.925, test acc 0.810 
4728.8 examples/sec on [device(type='cuda’, index=0), device(type='cuda’,. 
<index=1)] 
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14.1.3 Summary 


Image augmentation generates random images based on existing training data to improve 
the generalization ability of models. 


In order to obtain definitive results during prediction, we usually only apply image aug- 
mentation to training examples, and do not use image augmentation with random op- 
erations during prediction. 


Deep learning frameworks provide many different image augmentation methods, which 
can be applied simultaneously. 


14.1.4 Exercises 


. Train the model without using image augmentation: train_with_data_aug(test_augs, 


test_augs). Compare training and testing accuracy when using and not using image 
augmentation. Can this comparative experiment support the argument that image aug- 
mentation can mitigate overfitting? Why? 


Combine multiple different image augmentation methods in model training on the CIFAR- 
10 dataset. Does it improve test accuracy? 
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3. Refer to the online documentation of the deep learning framework. What other image 
augmentation methods does it also provide? 


Discussions?!?. 


14.2 Fine-Tuning 
SSS 8S 8888s 


In earlier chapters, we discussed how to train models on the Fashion-MNIST training 
dataset with only 60000 images. We also described ImageNet, the most widely used large- 
scale image dataset in academia, which has more than 10 million images and 1000 objects. 
However, the size of the dataset that we usually encounter is between those of the two 
datasets. 


Suppose that we want to recognize different types of chairs from images, and then recom- 
mend purchase links to users. One possible method is to first identify 100 common chairs, 
take 1000 images of different angles for each chair, and then train a classification model 
on the collected image dataset. Although this chair dataset may be larger than the Fashion- 
MNIST dataset, the number of examples is still less than one-tenth of that in ImageNet. 
This may lead to overfitting of complicated models that are suitable for ImageNet on this 
chair dataset. Besides, due to the limited amount of training examples, the accuracy of the 
trained model may not meet practical requirements. 


In order to address the above problems, an obvious solution is to collect more data. How- 
ever, collecting and labeling data can take a lot of time and money. For example, in order 
to collect the ImageNet dataset, researchers have spent millions of dollars from research 
funding. Although the current data collection cost has been significantly reduced, this cost 
still cannot be ignored. 


Another solution is to apply transfer learning to transfer the knowledge learned from the 
source dataset to the target dataset. For example, although most of the images in the Ima- 
geNet dataset have nothing to do with chairs, the model trained on this dataset may extract 
more general image features, which can help identify edges, textures, shapes, and object 
composition. These similar features may also be effective for recognizing chairs. 


14.2.1 Steps 


In this section, we will introduce a common technique in transfer learning: fine-tuning. As 
shown in Fig. 14.2.1, fine-tuning consists of the following four steps: 


1. Pretrain a neural network model, i.e., the source model, on a source dataset (e.g., the 
ImageNet dataset). 


2. Create a new neural network model, i.e., the target model. This copies all model de- 
signs and their parameters on the source model except the output layer. We assume that 
these model parameters contain the knowledge learned from the source dataset and this 
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knowledge will also be applicable to the target dataset. We also assume that the output 
layer of the source model is closely related to the labels of the source dataset; thus it is 
not used in the target model. 


3. Add an output layer to the target model, whose number of outputs is the number of 
categories in the target dataset. Then randomly initialize the model parameters of this 
layer. 


4. Train the target model on the target dataset, such as a chair dataset. The output layer 
will be trained from scratch, while the parameters of all the other layers are fine-tuned 
based on the parameters of the source model. 


Source Target 
model model 


Random Train from 
Output layer initialization E } scratch 


co; 
LayerL-1 }------- PY eae >| LayerL-1 
Pretrain rr 
srira OPY 2 = Bes Fine-tune 
yag D ” AA 
Source dataset Target dataset 


Fine tuning. 


When target datasets are much smaller than source datasets, fine-tuning helps to improve 
models’ generalization ability. 


14.2.2 Hot Dog Recognition 


Let’s demonstrate fine-tuning via a concrete case: hot dog recognition. We will fine-tune 
a ResNet model on a small dataset, which was pretrained on the ImageNet dataset. This 
small dataset consists of thousands of images with and without hot dogs. We will use the 
fine-tuned model to recognize hot dogs from images. 


%matplotlib inline 

import os 

import torch 

import torchvision 

from torch import nn 

from d21 import torch as d21 


Reading the Dataset 


The hot dog dataset we use was taken from online images. This dataset consists of 1400 
positive-class images containing hot dogs, and as many negative-class images containing 
other foods. 1000 images of both classes are used for training and the rest are for test- 
ing. 
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After unzipping the downloaded dataset, we obtain two folders hotdog/train and hotdog/ 
test. Both folders have hotdog and not-hotdog subfolders, either of which contains 
images of the corresponding class. 


#@save 
d21.DATA_HUB[ 'hotdog'] = (d21.DATA_URL + ‘hotdog.zip’, 
"fbha480ffa8aa7e0febbb511d181409F899b9baa5’ ) 


data_dir = d21.download_extract(’hotdog’) 


Downloading ../data/hotdog.zip from http://d21-data.s3-accelerate.amazonaws. 
—com/hotdog.zip... 


We create two instances to read all the image files in the training and testing datasets, re- 
spectively. 


train_imgs = torchvision.datasets.ImageFolder(os.path.join(data_dir, 'train')) 
test_imgs = torchvision.datasets.ImageFolder(os.path. join(data_dir, ‘test’)) 


The first 8 positive examples and the last 8 negative images are shown below. As you can 
see, the images vary in size and aspect ratio. 


hotdogs = [train_imgs[i][@] for i in range(8)] 
not_hotdogs = [train_imgs[-i - 1][0] for i in range(8)] 
d21.show_images(hotdogs + not_hotdogs, 2, 8, scale=1.4); 


During training, we first crop a random area of random size and random aspect ratio from 
the image, and then scale this area to a 224 x 224 input image. During testing, we scale both 
the height and width of an image to 256 pixels, and then crop a central 224 x 224 area as 
input. In addition, for the three RGB (red, green, and blue) color channels we standardize 
their values channel by channel. Concretely, the mean value of a channel is subtracted from 
each value of that channel and then the result is divided by the standard deviation of that 
channel. 


# Specify the means and standard deviations of the three RGB channels to 
# standardize each channel 
normalize = torchvision. transforms .Normalize( 


(continues on next page) 
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[0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) 


train_augs = torchvision.transforms.Compose([ 
torchvision. transforms .RandomResizedCrop(224) , 
torchvision. transforms .RandomHorizontalFlip(), 
torchvision. transforms. ToTensor(), 
normalize]) 


test_augs = torchvision.transforms.Compose(L 
torchvision. transforms.Resize([256, 256]), 
torchvision. transforms .CenterCrop(224) , 
torchvision. transforms. ToTensor(), 
normalize]) 


Defining and Initializing the Model 


We use ResNet-18, which was pretrained on the ImageNet dataset, as the source model. 
Here, we specify pretrained=True to automatically download the pretrained model pa- 
rameters. If this model is used for the first time, Internet connection is required for down- 
load. 


pretrained_net = torchvision.models.resnet18(pretrained=True) 


The pretrained source model instance contains a number of feature layers and an output 
layer fc. The main purpose of this division is to facilitate the fine-tuning of model param- 
eters of all layers but the output layer. The member variable fc of source model is given 
below. 


pretrained_net.fc 


Linear(in_features=512, out_features=1000, bias=True) 


As a fully connected layer, it transforms ResNet’s final global average pooling outputs into 
1000 class outputs of the ImageNet dataset. We then construct a new neural network as 
the target model. It is defined in the same way as the pretrained source model except that 
its number of outputs in the final layer is set to the number of classes in the target dataset 
(rather than 1000). 


In the code below, the model parameters before the output layer of the target model in- 
stance finetune_net are initialized to model parameters of the corresponding layers from 
the source model. Since these model parameters were obtained via pretraining on Ima- 
geNet, they are effective. Therefore, we can only use a small learning rate to fine-tune such 
pretrained parameters. In contrast, model parameters in the output layer are randomly ini- 
tialized and generally require a larger learning rate to be learned from scratch. Letting the 
base learning rate be 7, a learning rate of 107 will be used to iterate the model parameters 
in the output layer. 
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finetune_net = torchvision.models. resnet18(pretrained=True) 
finetune_net.fc = nn.Linear(finetune_net.fc.in_features, 2) 
nn.init.xavier_uniform_(finetune_net.fc.weight) ; 


Fine-Tuning the Model 


First, we define a training function train_fine_tuning that uses fine-tuning so it can be 
called multiple times. 


# If ‘param_group=True‘, the model parameters in the output layer will be 
# updated using a learning rate ten times greater 
def train_fine_tuning(net, learning_rate, batch_size=128, num_epochs=5, 
param_group=True) : 
train_iter = torch.utils.data.DataLoader(torchvision. datasets. ImageFolder( 
os.path.join(data_dir, ‘train’), transform=train_augs) , 
batch_size=batch_size, shuffle=True) 
test_iter = torch.utils.data.DataLoader(torchvision. datasets. ImageFolder( 
os.path.join(data_dir, ‘test’), transform=test_augs), 
batch_size=batch_size) 
devices = d21.try_all_gpus() 
loss = nn.CrossEntropyLoss(reduction="none") 
if param_group: 
params_1x = [param for name, param in net.named_parameters() 
if name not in ["fc.weight”, "fc.bias”]] 
trainer = torch.optim.SGD([{’params'’: params_1x}, 
{’params': net.fc.parameters(), 
‘Ir’: learning_rate x» 10}], 
lr=learning_rate, weight_decay=0. 001) 
else: 
trainer = torch.optim.SGD(net.parameters(), lr=learning_rate, 
weight_decay=0. 001) 
d21.train_chl3(net, train_iter, test_iter, loss, trainer, num_epochs, 
devices) 


We set the base learning rate to a small value in order to fine-tune the model parameters 
obtained via pretraining. Based on the previous settings, we will train the output layer 
parameters of the target model from scratch using a learning rate ten times greater. 


train_fine_tuning(finetune_net, 5e-5) 


loss 0.242, train acc 0.909, test acc 0.940 
1062.4 examples/sec on [device(type='cuda', index=0), device(type='cuda’,. 
<index=1)] 


For comparison, we define an identical model, but initialize all of its model parameters to 
random values. Since the entire model needs to be trained from scratch, we can use a larger 
learning rate. 
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scratch_net = torchvision.models.resnet18() 
scratch_net.fc = nn.Linear(scratch_net.fc.in_features, 2) 
train_fine_tuning(scratch_net, 5e-4, param_group=False) 


loss 0.352, train acc 0.846, test acc 0.850 
1525.4 examples/sec on [device(type='cuda', index=0), device(type='cuda’,. 
<index=1)] 
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As we can see, the fine-tuned model tends to perform better for the same epoch because its 
initial parameter values are more effective. 


14.2.3 Summary 


e Transfer learning transfers knowledge learned from the source dataset to the target dataset. 
Fine-tuning is a common technique for transfer learning. 


e The target model copies all model designs with their parameters from the source model 
except the output layer, and fine-tunes these parameters based on the target dataset. In 
contrast, the output layer of the target model needs to be trained from scratch. 


e Generally, fine-tuning parameters uses a smaller learning rate, while training the output 
layer from scratch can use a larger learning rate. 


14.2.4 Exercises 
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1. Keep increasing the learning rate of finetune_net. How does the accuracy of the 
model change? 


2. Further adjust hyperparameters of finetune_net and scratch_net in the comparative 
experiment. Do they still differ in accuracy? 


3. Set the parameters before the output layer of finetune_net to those of the source model 
and do not update them during training. How does the accuracy of the model change? 
You can use the following code. 


for param in finetune_net.parameters(): 
param.requires_grad = False 


4. In fact, there is a “hotdog” class in the ImageNet dataset. Its corresponding weight 
parameter in the output layer can be obtained via the following code. How can we 
leverage this weight parameter? 


weight = pretrained_net.fc.weight 
hotdog_w = torch.split(weight.data, 1, dim=0)[934] 
hotdog_w.shape 


torch.Size([1, 512]) 


Discussions?!°. 


14.3 Object Detection and Bounding Boxes 


In earlier sections (e.g., Section 8.1—Section 8.4), we introduced various models for image 
classification. In image classification tasks, we assume that there is only one major object 
in the image and we only focus on how to recognize its category. However, there are often 
multiple objects in the image of interest. We not only want to know their categories, but 
also their specific positions in the image. In computer vision, we refer to such tasks as 
object detection (or object recognition). 


Object detection has been widely applied in many fields. For example, self-driving needs to 
plan traveling routes by detecting the positions of vehicles, pedestrians, roads, and obstacles 
in the captured video images. Besides, robots may use this technique to detect and localize 
objects of interest throughout its navigation of an environment. Moreover, security systems 
may need to detect abnormal objects, such as intruders or bombs. 


In the next few sections, we will introduce several deep learning methods for object detec- 
tion. We will begin with an introduction to positions (or locations) of objects. 
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%matplotlib inline 
import torch 
from d21 import torch as d21 


We will load the sample image to be used in this section. We can see that there is a dog 
on the left side of the image and a cat on the right. They are the two major objects in this 
image. 


d21.set_figsize() 
img = d21.plt.imread('../img/catdog. jpg’) 
d21.plt.imshow(img) ; 
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14.3.1 Bounding Boxes 


In object detection, we usually use a bounding box to describe the spatial location of an 
object. The bounding box is rectangular, which is determined by the x and y coordinates 
of the upper-left corner of the rectangle and the such coordinates of the lower-right corner. 
Another commonly used bounding box representation is the (x, y)-axis coordinates of the 
bounding box center, and the width and height of the box. 


Here we define functions to convert between these two representations: box_corner_to_center 
converts from the two-corner representation to the center-width-height presentation, and 
box_center_to_corner vice versa. The input argument boxes should be a two-dimensional 
tensor of shape (n, 4), where n is the number of bounding boxes. 


#@save 

def box_corner_to_center (boxes) : 
"""Convert from (upper-left, lower-right) to (center, width, height)."”"” 
xl, yl, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3] 


Cx = (Gal + x2) / 2 
cy = (yl + y2) / 2 
w = x2 - x1 
h = y2 - y1 


boxes = torch.stack((cx, cy, w, h), axis=-1) 
return boxes 


#@save 
def box_center_to_corner (boxes) : 


(continues on next page) 
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"""Convert from (center, width, height) to (upper-left, lower-right). 
cx, cy, w, h = boxes[:, 9], boxes[:, 1], boxes[:, 2], boxes[:, 3] 


X= eX O5 AW 
yl =cy -0.5x*h 
X2 =cx + @.5*w 
y2 = cy +0.5xh 


boxes = torch.stack((x1, y1, x2, y2), axis=-1) 
return boxes 


We will define the bounding boxes of the dog and the cat in the image based on the co- 
ordinate information. The origin of the coordinates in the image is the upper-left corner 
of the image, and to the right and down are the positive directions of the x and y axes, 
respectively. 


# Here ‘bbox* is the abbreviation for bounding box 
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0] 


We can verify the correctness of the two bounding box conversion functions by converting 
twice. 


boxes = torch. tensor((dog_bbox, cat_bbox)) 
box_center_to_corner (box_corner_to_center(boxes)) == boxes 


tensor([L[True, True, True, True], 
[True, True, True, True]]) 


Let’s draw the bounding boxes in the image to check if they are accurate. Before drawing, 
we will define a helper function bbox_to_rect. It represents the bounding box in the 
bounding box format of the matplotlib package. 


#@save 
def bbox_to_rect(bbox, color): 
"""Convert bounding box to matplotlib format. 
# Convert the bounding box (upper-left x, upper-left y, lower-right x, 
# lower-right y) format to the matplotlib format: ((upper-left x, 
# upper-left y), width, height) 
return d21.plt.Rectangle( 
xy=(bboxl@], bbox[1]), width=bbox[2]-bboxl0], height=bbox[3]-bbox[1], 
fill=False, edgecolor=color, linewidth=2) 


nnn 


After adding the bounding boxes on the image, we can see that the main outline of the two 
objects are basically inside the two boxes. 


fig = d21.plt.imshow(img) 
fig.axes.add_patch(bbox_to_rect(dog_bbox, ‘blue’)) 
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red’)); 
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14.3.2 Summary 


e Object detection not only recognizes all the objects of interest in the image, but also their 
positions. The position is generally represented by a rectangular bounding box. 


e We can convert between two commonly used bounding box representations. 


14.3.3 Exercises 


1. Find another image and try to label a bounding box that contains the object. Compare 
labeling bounding boxes and categories: which usually takes longer? 


2. Why is the innermost dimension of the input argument boxes of box_corner_to_center 
and box_center_to_corner always 4? 


Discussions 2!*. 


14.4 Anchor Boxes 


Object detection algorithms usually sample a large number of regions in the input image, 
determine whether these regions contain objects of interest, and adjust the boundaries of 
the regions so as to predict the ground-truth bounding boxes of the objects more accurately. 
Different models may adopt different region sampling schemes. Here we introduce one of 
such methods: it generates multiple bounding boxes with varying scales and aspect ratios 
centered on each pixel. These bounding boxes are called anchor boxes. We will design an 
object detection model based on anchor boxes in Section 14.7. 


First, let’s modify the printing accuracy just for more concise outputs. 


zmatplotlib inline 
import torch 
from d21 import torch as d21 


torch.set_printoptions(2) # Simplify printing accuracy 
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14.4.1 Generating Multiple Anchor Boxes 


Suppose that the input image has a height of h and width of w. We generate anchor boxes 
with different shapes centered on each pixel of the image. Let the scale be s € (0, 1] and the 
aspect ratio (ratio of width to height) is r > 0. Then the width and height of the anchor box 
are wsyr and hs/r, respectively. Note that when the center position is given, an anchor 
box with known width and height is determined. 


To generate multiple anchor boxes with different shapes, let’s set a series of scales 51,..., Sn 
and a series of aspect ratios r1, ..., rm. When using all the combinations of these scales 
and aspect ratios with each pixel as the center, the input image will have a total of whnm 
anchor boxes. Although these anchor boxes may cover all the ground-truth bounding boxes, 
the computational complexity is easily too high. In practice, we can only consider those 
combinations containing sı or r1: 


(51,71), (81, r2), <. -3 (S1, fm), (82,11), (83,171),-++5 (Sn, r1). (14.4.1) 


That is to say, the number of anchor boxes centered on the same pixel is n +m — 1. For the 
entire input image, we will generate a total of wh(n +m — 1) anchor boxes. 


The above method of generating anchor boxes is implemented in the following multi- 
box_prior function. We specify the input image, a list of scales, and a list of aspect ratios, 
then this function will return all the anchor boxes. 


#@save 

def multibox_prior(data, sizes, ratios): 
"""Generate anchor boxes with different shapes centered on each pixel. 
in_height, in_width = data.shape[-2:] 
device, num_sizes, num_ratios = data.device, len(sizes), len(ratios) 
boxes_per_pixel = (num_sizes + num_ratios - 1) 
size_tensor = torch.tensor(sizes, device=device) 
ratio_tensor = torch.tensor(ratios, device=device) 
# Offsets are required to move the anchor to the center of a pixel. Since 
# a pixel has height=1 and width=1, we choose to offset our centers by 0.5 
offset_h, offset_w = 0.5, 0.5 
steps_h = 1.0 / in_height # Scaled steps in y axis 
steps_w = 1.0 / in_width # Scaled steps in x axis 


nnn 


# Generate all center points for the anchor boxes 

center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h 
center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w 
shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij’) 
shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1) 


# Generate ‘boxes_per_pixel* number of heights and widths that are later 
# used to create anchor box corner coordinates (xmin, xmax, ymin, ymax) 
w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[@]), 

sizes[Q] x torch.sqrt(ratio_tensor[1:])))\ 

* in_height / in_width # Handle rectangular inputs 
h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[Q]), 

sizes[Q] / torch.sqrt(ratio_tensor[1:]))) 
# Divide by 2 to get half height and half width 
anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat( 
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in_height * in_width, 1) / 2 


# Each center point will have ‘boxes_per_pixel* number of anchor boxes, so 

# generate a grid of all anchor box centers with ‘boxes_per_pixel* repeats 

out_grid = torch.stack(Lshift_x, shift_y, shift_x, shift_y], 
dim=1).repeat_interleave(boxes_per_pixel, dim=Q) 

output = out_grid + anchor_manipulations 

return output.unsqueeze(Q) 


We can see that the shape of the returned anchor box variable Y is (batch size, number of 
anchor boxes, 4). 


img = d21.plt.imread(’../img/catdog. jpg’) 
h, w = img.shapeL:2] 


print(h, w) 

X = torch.rand(size=(1, 3, h, w)) # Construct input data 

Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, @.5]) 
Y. shape 
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torch.Size([1, 2042040, 4]) 


After changing the shape of the anchor box variable Y to (image height, image width, num- 
ber of anchor boxes centered on the same pixel, 4), we can obtain all the anchor boxes 
centered on a specified pixel position. In the following, we access the first anchor box cen- 
tered on (250, 250). It has four elements: the (x, y)-axis coordinates at the upper-left corner 
and the (x, y)-axis coordinates at the lower-right corner of the anchor box. The coordinate 
values of both axes are divided by the width and height of the image, respectively. 


boxes = Y.reshape(h, w, 5, 4) 
boxes[250, 250, 0, :] 


tensor(LQ.06, 0.07, 0.63, @.82]) 


In order to show all the anchor boxes centered on one pixel in the image, we define the 
following show_bboxes function to draw multiple bounding boxes on the image. 


#@save 
def show_bboxes(axes, bboxes, labels=None, colors=None): 
"""Show bounding boxes.”"" 


def make_list(obj, default_values=None): 
if obj is None: 
obj = default_values 


(continues on next page) 
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elif not isinstance(obj, (list, tuple)): 
obj = [obj] 
return obj 


labels = make_list(labels) 
colors = make_list(colors, ['b', 'g', ‘r', 'm’, 'c']) 
for i, bbox in enumerate(bboxes): 
color = colors[i % len(colors)] 
rect = d21.bbox_to_rect(bbox.detach() .numpy(), color) 
axes. add_patch(rect) 
if labels and len(labels) > i: 
text_color = 'k' if color == 'w’ else ‘w' 
axes.text(rect.xy[0], rect.xy[1], labels[il], 
va='center', ha='center’, fontsize=9, color=text_color, 
bbox=dict(facecolor=color, lw=@)) 


As we just saw, the coordinate values of the x and y axes in the variable boxes have been 
divided by the width and height of the image, respectively. When drawing anchor boxes, 
we need to restore their original coordinate values; thus, we define variable bbox_scale 
below. Now, we can draw all the anchor boxes centered on (250, 250) in the image. As you 
can see, the blue anchor box with a scale of 0.75 and an aspect ratio of 1 well surrounds 
the dog in the image. 


d21.set_figsize() 

bbox_scale = torch.tensor((w, h, w, h)) 

fig = d21.plt.imshow(img) 

show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale, 
['s=@.75, r=1', 's=0.5, r=1', 's=@.25, r=1', 's=0.75, r=2', 
'S=O) 13, FAS I) 


s=0.75, r=0.5 


14.4.2 Intersection over Union (IoU) 


We just mentioned that an anchor box “well” surrounds the dog in the image. If the ground- 
truth bounding box of the object is known, how can “well” here be quantified? Intuitively, 
we can measure the similarity between the anchor box and the ground-truth bounding box. 
We know that the Jaccard index can measure the similarity between two sets. Given sets 
A and $, their Jaccard index is the size of their intersection divided by the size of their 
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union: 
LAN B| 
AU B| 


In fact, we can consider the pixel area of any bounding box as a set of pixels. In this way, 


J(A, B) = (14.4.2) 


we can measure the similarity of the two bounding boxes by the Jaccard index of their pixel 
sets. For two bounding boxes, we usually refer their Jaccard index as intersection over 
union (IoU), which is the ratio of their intersection area to their union area, as shown in 
Fig. 14.4.1. The range of an IoU is between 0 and 1: 0 means that two bounding boxes do 
not overlap at all, while 1 indicates that the two bounding boxes are equal. 


rt 
ae 


loU = 


| IoU is the ratio of the intersection area to the union area of two bounding boxes. 


For the remainder of this section, we will use IoU to measure the similarity between anchor 
boxes and ground-truth bounding boxes, and between different anchor boxes. Given two 
lists of anchor or bounding boxes, the following box_iou computes their pairwise IoU 
across these two lists. 


#@save 
def box_iou(boxes1, boxes2): 
"""Compute pairwise IoU across two lists of anchor or bounding boxes. 
box_area = lambda boxes: ((boxes[:, 2] - boxes[:, @]) * 
(boxes[:, 3] - boxes[:, 1])) 
# Shape of `boxes1`, ‘boxes2‘, ‘areasl‘, ‘areas2‘: (no. of boxes1, 4), 
# (no. of boxes2, 4), (no. of boxesl,), (no. of boxes2,) 
areasl = box_area(boxes1) 
areas2 = box_area(boxes2) 


nnn 


# Shape of ‘inter_upperlefts*, ‘`inter_lowerrights`, ‘inters*: (no. of 
# boxes1, no. of boxes2, 2) 
inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2]) 


inter_lowerrights = torch.min(boxesi[:, None, 2:], boxes2[:, 2:]) 

inters = (inter_lowerrights - inter_upperlefts) .clamp(min=0) 

# Shape of ‘inter_areas* and ‘union_areas*: (no. of boxes1, no. of boxes2) 
inter_areas = inters[:, :, 0] * inters[:, :, 1] 

union_areas = areasl[:, None] + areas2 - inter_areas 

return inter_areas / union_areas 


14.4.3 Labeling Anchor Boxes in Training Data 


In a training dataset, we consider each anchor box as a training example. In order to train 
an object detection model, we need class and offset labels for each anchor box, where the 
former is the class of the object relevant to the anchor box and the latter is the offset of the 
ground-truth bounding box relative to the anchor box. During the prediction, for each im- 
age we generate multiple anchor boxes, predict classes and offsets for all the anchor boxes, 
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adjust their positions according to the predicted offsets to obtain the predicted bounding 
boxes, and finally only output those predicted bounding boxes that satisfy certain crite- 
ria. 


As we know, an object detection training set comes with labels for locations of ground-truth 
bounding boxes and classes of their surrounded objects. To label any generated anchor box, 
we refer to the labeled location and class of its assigned ground-truth bounding box that is 
closest to the anchor box. In the following, we describe an algorithm for assigning closest 
ground-truth bounding boxes to anchor boxes. 


Assigning Ground-Truth Bounding Boxes to Anchor Boxes 


Given an image, suppose that the anchor boxes are Aj, A2,..., An, and the ground-truth 
bounding boxes are B1, Bz,..., Bn,, where ng > np. Let’s define a matrix X € R"*", 
whose element x;; in the i™ row and J th column is the IoU of the anchor box A; and the 
ground-truth bounding box B;. The algorithm consists of the following steps: 


1. Find the largest element in matrix X and denote its row and column indices as i, and 
jı, respectively. Then the ground-truth bounding box B;, is assigned to the anchor box 
Ai. This is quite intuitive because A;, and B; are the closest among all the pairs of 
anchor boxes and ground-truth bounding boxes. After the first assignment, discard all 
the elements in the i," row and the j,"° column in matrix X. 


2. Find the largest of the remaining elements in matrix X and denote its row and column 
indices as i2 and j2, respectively. We assign ground-truth bounding box B;, to anchor 
box A;, and discard all the elements in the i>" row and the ja® column in matrix X. 


3. At this point, elements in two rows and two columns in matrix X have been discarded. 
We proceed until all elements in np columns in matrix X are discarded. At this time, 
we have assigned a ground-truth bounding box to each of np anchor boxes. 


4. Only traverse through the remaining ng — np anchor boxes. For example, given any 
anchor box A;, find the ground-truth bounding box B; with the largest IoU with A; 
throughout the i™ row of matrix X, and assign B j to A; only if this IoU is greater than 
a predefined threshold. 


Let’s illustrate the above algorithm using a concrete example. As shown in Fig. 14.4.2 (left), 
assuming that the maximum value in matrix X is x23, we assign the ground-truth bounding 
box B3 to the anchor box A2. Then, we discard all the elements in row 2 and column 3 of the 
matrix, find the largest x7; in the remaining elements (shaded area), and assign the ground- 
truth bounding box B; to the anchor box A7. Next, as shown in Fig. 14.4.2 (middle), discard 
all the elements in row 7 and column 1 of the matrix, find the largest x54 in the remaining 
elements (shaded area), and assign the ground-truth bounding box B4 to the anchor box 
As. Finally, as shown in Fig. 14.4.2 (right), discard all the elements in row 5 and column 4 
of the matrix, find the largest x92 in the remaining elements (shaded area), and assign the 
ground-truth bounding box B> to the anchor box Ag. After that, we only need to traverse 
through the remaining anchor boxes A1, A3, A4, A6, Ag and determine whether to assign 
them ground-truth bounding boxes according to the threshold. 
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Assigning ground-truth bounding boxes to anchor boxes. 


This algorithm is implemented in the following assign_anchor_to_bbox function. 


#@save 
def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=@.5): 
"""Assign closest ground-truth bounding boxes to anchor boxes.””” 
num_anchors, num_gt_boxes = anchors.shape[Q], ground_truth.shapeL@] 
# Element x_ij in the i-th row and j-th column is the IoU of the anchor 
# box i and the ground-truth bounding box j 
jaccard = box_iou(anchors, ground_truth) 
# Initialize the tensor to hold the assigned ground-truth bounding box for 
# each anchor 
anchors_bbox_map = torch. full((num_anchors,), -1, dtype=torch. long, 
device=device) 
# Assign ground-truth bounding boxes according to the threshold 
max_ious, indices = torch.max(jaccard, dim=1) 
anc_i = torch.nonzero(max_ious >= iou_threshold) .reshape(-1) 
box_j = indices[max_ious >= iou_threshold] 
anchors_bbox_mapLanc_i] = box_j 
col_discard = torch. full((num_anchors,), -1) 
row_discard = torch. full((num_gt_boxes,), -1) 
for _ in range(num_gt_boxes) : 
max_idx = torch.argmax(jaccard) # Find the largest IoU 
box_idx = (max_idx % num_gt_boxes) . long() 
anc_idx = (max_idx / num_gt_boxes) . long() 
anchors_bbox_mapLanc_idx] = box_idx 
jaccard[:, box_idx] = col_discard 
jaccard[Lanc_idx, :] = row_discard 
return anchors_bbox_map 


Labeling Classes and Offsets 


Now we can label the class and offset for each anchor box. Suppose that an anchor box 
A is assigned a ground-truth bounding box B. On the one hand, the class of the anchor 
box A will be labeled as that of B. On the other hand, the offset of the anchor box A will 
be labeled according to the relative position between the central coordinates of B and A 
together with the relative size between these two boxes. Given varying positions and sizes 
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of different boxes in the dataset, we can apply transformations to those relative positions 
and sizes that may lead to more uniformly distributed offsets that are easier to fit. Here we 
describe a common transformation. Given the central coordinates of A and B as (Xa, Ya) 
and (xp, yp), their widths as wa and wp, and their heights as ha and hp, respectively. We 
may label the offset of A as 


Xb-Xa _ Yb7ya _ Wb _ hp =. 
Wa Hx l ha Hy , log Wa Hw log liq Hh : (14.4.3) 
Tx Ty ow Oh 


where default values of the constants are ux = Hy = Hw = Hh = 0, 0x = oy = 0.1, 
and oy = Oo, = 0.2. This transformation is implemented below in the offset_boxes 
function. 


#@save 
def offset_boxes(anchors, assigned_bb, eps=1e-6): 
"""Transform for anchor box offsets.””"” 
c_anc = d21.box_corner_to_center (anchors) 
c_assigned_bb = d21.box_corner_to_center(assigned_bb) 
offset_xy = 10 * (c_assigned_bb[:, :2] - c_ancL:, :2]) / c_ancL:, 2:] 
offset_wh = 5 x torch.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:]) 
offset = torch.cat(Loffset_xy, offset_wh], axis=1) 
return offset 


If an anchor box is not assigned a ground-truth bounding box, we just label the class of 
the anchor box as “background”. Anchor boxes whose classes are background are often 
referred to as negative anchor boxes, and the rest are called positive anchor boxes. We 
implement the following multibox_target function to label classes and offsets for anchor 
boxes (the anchors argument) using ground-truth bounding boxes (the labels argument). 
This function sets the background class to zero and increments the integer index of a new 
class by one. 


#@save 
def multibox_target(anchors, labels): 
"""Label anchor boxes using ground-truth bounding boxes. 
batch_size, anchors = labels.shape[@], anchors. squeeze(Q) 
batch_offset, batch_mask, batch_class_labels = [], [J], [] 
device, num_anchors = anchors.device, anchors.shapeLQ] 
for i in range(batch_size): 
label = labels[i, :, :] 
anchors_bbox_map = assign_anchor_to_bbox( 
label[:, 1:], anchors, device) 
bbox_mask = ((anchors_bbox_map >= 9).float().unsqueeze(-1)).repeat( 
1, 4) 
# Initialize class labels and assigned bounding box coordinates with 
# zeros 
class_labels = torch.zeros(num_anchors, dtype=torch. long, 
device=device) 
assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32, 
device=device) 
# Label classes of anchor boxes using their assigned ground-truth 
# bounding boxes. If an anchor box is not assigned any, we label its 


nnn 
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# class as background (the value remains zero) 
indices_true = torch.nonzero(anchors_bbox_map >= 0) 
bb_idx = anchors_bbox_map[indices_true] 
class_labels[indices_true] = label[bb_idx, @].long() + 1 
assigned_bbLindices_true] = label[bb_idx, 1:] 
# Offset transformation 
offset = offset_boxes(anchors, assigned_bb) * bbox_mask 
batch_offset.append(offset.reshape(-1)) 
batch_mask. append(bbox_mask. reshape(-1)) 
batch_class_labels.append(class_labels) 

bbox_offset = torch.stack(batch_offset) 

bbox_mask = torch.stack(batch_mask) 

class_labels = torch.stack(batch_class_labels) 

return (bbox_offset, bbox_mask, class_labels) 


An Example 


Let’s illustrate anchor box labeling via a concrete example. We define ground-truth bound- 
ing boxes for the dog and cat in the loaded image, where the first element is the class (0 
for dog and 1 for cat) and the remaining four elements are the (x, y)-axis coordinates at 
the upper-left corner and the lower-right corner (range is between 0 and 1). We also con- 
struct five anchor boxes to be labeled using the coordinates of the upper-left corner and the 
lower-right corner: Ao, ..., A4 (the index starts from 0). Then we plot these ground-truth 
bounding boxes and anchor boxes in the image. 


ground_truth = torch.tensor([[@, 0.1, 0.08, 0.52, 0.92], 
[1, 0.55, 0.2, @.9, 0.88]]) 
anchors = torch.tensor([[@, @.1, 0.2, 0.3], [@.15, 0.2, 0.4, 0.4], 
[0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8], 
[0.57, 0.3, 0.92, 0.9]]) 


fig = d21.plt.imshow(img) 
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog’, ‘cat'], ‘k’) 
show_bboxes(fig.axes, anchors * bbox_scale, ['@’, ‘1’, ‘2’, ‘3’, '4’]); 


0) 200 400 600 


Using the multibox_target function defined above, we can label classes and offsets of 
these anchor boxes based on the ground-truth bounding boxes for the dog and cat. In this 
example, indices of the background, dog, and cat classes are 0, 1, and 2, respectively. 
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Below we add an dimension for examples of anchor boxes and ground-truth bounding 
boxes. 


labels = multibox_target(anchors.unsqueeze(dim=@) , 
ground_truth. unsqueeze(dim=0) ) 


There are three items in the returned result, all of which are in the tensor format. The third 
item contains the labeled classes of the input anchor boxes. 


Let’s analyze the returned class labels below based on anchor box and ground-truth bound- 
ing box positions in the image. First, among all the pairs of anchor boxes and ground-truth 
bounding boxes, the IoU of the anchor box A4 and the ground-truth bounding box of the 
cat is the largest. Thus, the class of A4 is labeled as the cat. Taking out pairs containing 
A4 or the ground-truth bounding box of the cat, among the rest the pair of the anchor box 
A, and the ground-truth bounding box of the dog has the largest IoU. So the class of A; is 
labeled as the dog. Next, we need to traverse through the remaining three unlabeled anchor 
boxes: Ao, A2, and A3. For Ao, the class of the ground-truth bounding box with the largest 
IoU is the dog, but the IoU is below the predefined threshold (0.5), so the class is labeled 
as background; for A2, the class of the ground-truth bounding box with the largest IoU is 
the cat and the IoU exceeds the threshold, so the class is labeled as the cat; for A3, the class 
of the ground-truth bounding box with the largest IoU is the cat, but the value is below the 
threshold, so the class is labeled as background. 


labels[2] 


tensor([[0, 1, 2, ð, 2]]) 


The second returned item is a mask variable of the shape (batch size, four times the number 
of anchor boxes). Every four elements in the mask variable correspond to the four offset 
values of each anchor box. Since we do not care about background detection, offsets of this 
negative class should not affect the objective function. Through elementwise multiplica- 
tions, zeros in the mask variable will filter out negative class offsets before calculating the 
objective function. 


labels[1] 


tensor([[0., 0., ©., ©., 1., 1., 1., 1., 1., 1., 1., 1., @., O., O., O., 1., 1. 


>, 


4.5 1411) 


The first returned item contains the four offset values labeled for each anchor box. Note 
that the offsets of negative-class anchor boxes are labeled as zeros. 


labels[9] 
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tensor ([[-0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00, 1.40e+00, 1.00e+01, 
2.59e+00, 7.18e+00, -1.20e+00, 2.69e-@1, 1.68e+00, -1.57e+00, 
-0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00, -5.71e-01, -1.00e+00, 
4.17e-06, 6.26e-01]]) 


14.4.4 Predicting Bounding Boxes with Non-Maximum Suppression 


During prediction, we generate multiple anchor boxes for the image and predict classes and 
offsets for each of them. A predicted bounding box is thus obtained according to an anchor 
box with its predicted offset. Below we implement the of fset_inverse function that takes 
in anchors and offset predictions as inputs and applies inverse offset transformations to 
return the predicted bounding box coordinates. 


#@save 

def offset_inverse(anchors, offset_preds): 
"""Predict bounding boxes based on anchor boxes with predicted offsets. 
anc = d21.box_corner_to_center (anchors) 
pred_bbox_xy = (offset_preds[:, :2] * anc[:, 2:] / 10) + ancL:, :2] 
pred_bbox_wh = torch.exp(offset_preds[:, 2:] / 5) * ancl[:, 2:] 
pred_bbox = torch.cat((pred_bbox_xy, pred_bbox_wh), axis=1) 
predicted_bbox = d21.box_center_to_corner (pred_bbox) 
return predicted_bbox 


nnn 


When there are many anchor boxes, many similar (with significant overlap) predicted bound- 
ing boxes can be potentially output for surrounding the same object. To simplify the output, 
we can merge similar predicted bounding boxes that belong to the same object by using 
non-maximum suppression (NMS). 


Here is how non-maximum suppression works. For a predicted bounding box B, the object 
detection model calculates the predicted likelihood for each class. Denoting by p the largest 
predicted likelihood, the class corresponding to this probability is the predicted class for B. 
Specifically, we refer to p as the confidence (score) of the predicted bounding box B. On the 
same image, all the predicted non-background bounding boxes are sorted by confidence in 
descending order to generate a list L. Then we manipulate the sorted list L in the following 
steps: 


1. Select the predicted bounding box B, with the highest confidence from L as a basis and 
remove all non-basis predicted bounding boxes whose IoU with Bı exceeds a predefined 
threshold e from L. At this point, L keeps the predicted bounding box with the highest 
confidence but drops others that are too similar to it. In a nutshell, those with non- 
maximum confidence scores are suppressed. 


2. Select the predicted bounding box B2 with the second highest confidence from L as 
another basis and remove all non-basis predicted bounding boxes whose IoU with B2 
exceeds e from L. 


3. Repeat the above process until all the predicted bounding boxes in L have been used as 
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a basis. At this time, the IoU of any pair of predicted bounding boxes in L is below the 
threshold e; thus, no pair is too similar with each other. 


4. Output all the predicted bounding boxes in the list L. 


The following nms function sorts confidence scores in descending order and returns their 
indices. 


#@save 
def nms(boxes, scores, iou_threshold): 
"""Sort confidence scores of predicted bounding boxes. 
B = torch.argsort(scores, dim=-1, descending=True) 
keep = [] # Indices of predicted bounding boxes that will be kept 
while B.numel() > ð: 
i = B[@] 
keep. append(i) 
if B.numel() == 1: break 
iou = box_iou(boxesLi, :].reshape(-1, 4), 
boxes[B[1:], :].reshape(-1, 4)).reshape(-1) 
inds = torch.nonzero(iou <= iou_threshold) .reshape(-1) 
B = B[inds + 1] 
return torch.tensor(keep, device=boxes.device) 


nnn 


We define the following multibox_detection to apply non-maximum suppression to pre- 
dicting bounding boxes. Do not worry if you find the implementation a bit complicated: 
we will show how it works with a concrete example right after the implementation. 


#@save 
def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5, 
pos_threshold=0. 009999999): 
"""Predict bounding boxes using non-maximum suppression. 
device, batch_size = cls_probs.device, cls_probs.shape[2] 
anchors = anchors. squeeze(Q) 
num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2] 
out = [] 
for i in range(batch_size): 
cls_prob, offset_pred = cls_probsli], offset_preds[i].reshape(-1, 4) 
conf, class_id = torch.max(cls_prob[1:], 2) 
predicted_bb = offset_inverse(anchors, offset_pred) 
keep = nms(predicted_bb, conf, nms_threshold) 
# Find all non-‘keep* indices and set the class to background 
all_idx = torch.arange(num_anchors, dtype=torch.long, device=device) 
combined = torch.cat((keep, all_idx)) 
uniques, counts = combined.unique(return_counts=True) 


non 


non_keep = uniques[counts == 1] 
all_id_sorted = torch.cat((keep, non_keep)) 
class_id[non_keep] = -1 


class_id = class_id[all_id_sorted] 

conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted] 
# Here ‘pos_threshold* is a threshold for positive (non-background) 
# predictions 

below_min_idx = (conf < pos_threshold) 

class_id[below_min_idx] = -1 

conf [below_min_idx] = 1 - conf[below_min_idx] 


(continues on next page) 
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pred_info = torch.cat((class_id.unsqueeze(1), 
conf .unsqueeze(1), 
predicted_bb), dim=1) 
out .append(pred_info) 
return torch.stack(out) 


Now let’s apply the above implementations to a concrete example with four anchor boxes. 
For simplicity, we assume that the predicted offsets are all zeros. This means that the 
predicted bounding boxes are anchor boxes. For each class among the background, dog, 
and cat, we also define its predicted likelihood. 


anchors = torch.tensor([[0.1, 0.08, 0.52, 0.92], [@.08, 0.2, 0.56, 0.95], 
(0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, @.88]]) 

offset_preds = torch.tensor([Q] * anchors.numel()) 

cls_probs = torch.tensor([[@] * 4, # Predicted background likelihood 
[0.9, @.8, 0.7, @.1], # Predicted dog likelihood 
[@.1, 0.2, 0.3, @.9]]) # Predicted cat likelihood 


We can plot these predicted bounding boxes with their confidence on the image. 


fig = d21.plt.imshow(img) 
show_bboxes(fig.axes, anchors * bbox_scale, 
L'dog=0.9', ‘dog=0.8', 'dog=0.7', ‘cat=0.9']) 


Now we can invoke the multibox_detection function to perform non-maximum suppres- 
sion, where the threshold is set to 0.5. Note that we add a dimension for examples in the 
tensor input. 


We can see that the shape of the returned result is (batch size, number of anchor boxes, 
6). The six elements in the innermost dimension gives the output information for the same 
predicted bounding box. The first element is the predicted class index, which starts from 
0 (0 is dog and 1 is cat). The value -1 indicates background or removal in non-maximum 
suppression. The second element is the confidence of the predicted bounding box. The 
remaining four elements are the (x, y)-axis coordinates of the upper-left corner and the 
lower-right corner of the predicted bounding box, respectively (range is between 0 and 


1). 
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output = multibox_detection(cls_probs.unsqueeze(dim=0) , 
offset_preds.unsqueeze(dim=0) , 
anchors.unsqueeze(dim=@) , 
nms_threshold=0.5) 

output 


tensor (CCC 0.00, 0.90, 0.10, 0.08, 0.52, 0.92], 
[ 1.00, 0.90, 0.55, 0.20, 0.90, 0.88], 
[-1.00, 0.80, 0.08, 0.20, 0.56, 0.95], 
[-1.00, 0.70, 0.15, 0.30, 0.62, .91]]]) 


After removing those predicted bounding boxes of class -1, we can output the final predicted 
bounding box kept by non-maximum suppression. 


fig = d21.plt.imshow(img) 
for i in outputl0].detach().numpy(): 
ip ALG) == ale 
continue 
label = ('dog=', 'cat=')[int(i[0])] + str(i[1]) 
show_bboxes(fig.axes, [torch.tensor(i[l2:]) * bbox_scale], label) 


0 200 400 600 


In practice, we can remove predicted bounding boxes with lower confidence even before 
performing non-maximum suppression, thereby reducing computation in this algorithm. 
We may also post-process the output of non-maximum suppression, for example, by only 
keeping results with higher confidence in the final output. 


14.4.5 Summary 


e We generate anchor boxes with different shapes centered on each pixel of the image. 


Intersection over union (IoU), also known as Jaccard index, measures the similarity of 
two bounding boxes. It is the ratio of their intersection area to their union area. 


In a training set, we need two types of labels for each anchor box. One is the class of 
the object relevant to the anchor box and the other is the offset of the ground-truth 
bounding box relative to the anchor box. 


During prediction, we can use non-maximum suppression (NMS) to remove similar pre- 
dicted bounding boxes, thereby simplifying the output. 
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14.4.6 Exercises 


1. Change values of sizes and ratios in the multibox_prior function. What are the 
changes to the generated anchor boxes? 


2. Construct and visualize two bounding boxes with an IoU of 0.5. How do they overlap 
with each other? 


3. Modify the variable anchors in Section 14.4.3 and Section 14.4.4. How do the results 
change? 


4. Non-maximum suppression is a greedy algorithm that suppresses predicted bounding 
boxes by removing them. Is it possible that some of these removed ones are actually 
useful? How can this algorithm be modified to suppress softly? You may refer to Soft- 
NMS (Bodla et al., 2017). 


5. Rather than being hand-crafted, can non-maximum suppression be learned? 


Discussions?!°. 


14.5 Multiscale Object Detection 
_————————————— 


In Section 14.4, we generated multiple anchor boxes centered on each pixel of an input 
image. Essentially these anchor boxes represent samples of different regions of the image. 
However, we may end up with too many anchor boxes to compute if they are generated for 
every pixel. Think of a 561 x 728 input image. If five anchor boxes with varying shapes 
are generated for each pixel as their center, over two million anchor boxes (561 x 728 x 5) 
need to be labeled and predicted on the image. 


14.5.1 Multiscale Anchor Boxes 


You may realize that it is not difficult to reduce anchor boxes on an image. For instance, we 
can just uniformly sample a small portion of pixels from the input image to generate anchor 
boxes centered on them. In addition, at different scales we can generate different numbers 
of anchor boxes of different sizes. Intuitively, smaller objects are more likely to appear on 
an image than larger ones. As an example, 1 x 1, 1 x 2, and 2 x 2 objects can appear on a 
2x2 image in 4, 2, and | possible ways, respectively. Therefore, when using smaller anchor 
boxes to detect smaller objects, we can sample more regions, while for larger objects we 
can sample fewer regions. 


To demonstrate how to generate anchor boxes at multiple scales, let’s read an image. Its 
height and width are 561 and 728 pixels, respectively. 


%matplotlib inline 
import torch 


(continues on next page) 
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from d21 import torch as d21 


img = d21.plt.imread(’../img/catdog. jpg’) 
h, w = img.shapeL:2] 
h, w 


(561, 728) 


Recall that in Section 7.2 we call a two-dimensional array output of a convolutional layer 
a feature map. By defining the feature map shape, we can determine centers of uniformly 
sampled anchor boxes on any image. 


The display_anchors function is defined below. We generate anchor boxes (anchors) on 
the feature map (fmap) with each unit (pixel) as the anchor box center. Since the (x, y)- 
axis coordinate values in the anchor boxes (anchors) have been divided by the width and 
height of the feature map (fmap), these values are between 0 and 1, which indicate the 
relative positions of anchor boxes in the feature map. 


Since centers of the anchor boxes (anchors) are spread over all units on the feature map 
(fmap), these centers must be uniformly distributed on any input image in terms of their 
relative spatial positions. More concretely, given the width and height of the feature map 
fmap_w and fmap_h, respectively, the following function will uniformly sample pixels in 
fmap_h rows and fmap_w columns on any input image. Centered on these uniformly sam- 
pled pixels, anchor boxes of scale s (assuming the length of the list s is 1) and different 
aspect ratios (ratios) will be generated. 


def display_anchors(fmap_w, fmap_h, s): 

d21.set_figsize() 
# Values on the first two dimensions do not affect the output 
fmap = torch.zeros((1, 10, fmap_h, fmap_w)) 
anchors = d21.multibox_prior(fmap, sizes=s, ratios=[1, 2, @.5]) 
bbox_scale = torch.tensor((w, h, w, h)) 
d21.show_bboxes(d21.p1t.imshow(img) . axes, 

anchors[0] * bbox_scale) 


First, let’s consider detection of small objects. In order to make it easier to distinguish 
when displayed, the anchor boxes with different centers here do not overlap: the anchor 
box scale is set to 0.15 and the height and width of the feature map are set to 4. We can see 
that the centers of the anchor boxes in 4 rows and 4 columns on the image are uniformly 
distributed. 


display_anchors(fmap_w=4, fmap_h=4, s=[0.15]) 


We move on to reduce the height and width of the feature map by half and use larger anchor 
boxes to detect larger objects. When the scale is set to 0.4, some anchor boxes will overlap 
with each other. 
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display_anchors(fmap_w=2, fmap_h=2, s=[0.4]) 
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Finally, we further reduce the height and width of the feature map by half and increase the 
anchor box scale to 0.8. Now the center of the anchor box is the center of the image. 


display_anchors(fmap_w=1, fmap_h=1, s=[0.8]) 


14.5.2 Multiscale Detection 


Since we have generated multiscale anchor boxes, we will use them to detect objects of 
various sizes at different scales. In the following we introduce a CNN-based multiscale 
object detection method that we will implement in Section 14.7. 


At some scale, say that we have c feature maps of shape hx w. Using the method in Section 
14.5.1, we generate hw sets of anchor boxes, where each set has a anchor boxes with the 
same center. For example, at the first scale in the experiments in Section 14.5.1, given ten 


626 


Computer Vision 


(number of channels) 4 x 4 feature maps, we generated 16 sets of anchor boxes, where each 
set contains 3 anchor boxes with the same center. Next, each anchor box is labeled with 
the class and offset based on ground-truth bounding boxes. At the current scale, the object 
detection model needs to predict the classes and offsets of hw sets of anchor boxes on the 
input image, where different sets have different centers. 


Assume that the c feature maps here are the intermediate outputs obtained by the CNN for- 
ward propagation based on the input image. Since there are hw different spatial positions 
on each feature map, the same spatial position can be thought of as having c units. Ac- 
cording to the definition of receptive field in Section 7.2, these c units at the same spatial 
position of the feature maps have the same receptive field on the input image: they repre- 
sent the input image information in the same receptive field. Therefore, we can transform 
the c units of the feature maps at the same spatial position into the classes and offsets of the 
a anchor boxes generated using this spatial position. In essence, we use the information of 
the input image in a certain receptive field to predict the classes and offsets of the anchor 
boxes that are close to that receptive field on the input image. 


When the feature maps at different layers have varying-size receptive fields on the input 
image, they can be used to detect objects of different sizes. For example, we can design a 
neural network where units of feature maps that are closer to the output layer have wider 
receptive fields, so they can detect larger objects from the input image. 


In anutshell, we can leverage layerwise representations of images at multiple levels by deep 
neural networks for multiscale object detection. We will show how this works through a 
concrete example in Section 14.7. 


14.5.3 Summary 


At multiple scales, we can generate anchor boxes with different sizes to detect objects 
with different sizes. 


By defining the shape of feature maps, we can determine centers of uniformly sampled 
anchor boxes on any image. 


We use the information of the input image in a certain receptive field to predict the classes 
and offsets of the anchor boxes that are close to that receptive field on the input image. 


Through deep learning, we can leverage its layerwise representations of images at mul- 
tiple levels for multiscale object detection. 


14.5.4 Exercises 


1. According to our discussions in Section 8.1, deep neural networks learn hierarchical 
features with increasing levels of abstraction for images. In multiscale object detection, 
do feature maps at different scales correspond to different levels of abstraction? Why or 
why not? 


2. At the first scale (Ffmap_w=4, fmap_h=4) in the experiments in Section 14.5.1, generate 
uniformly distributed anchor boxes that may overlap. 
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3. Given a feature map variable with shape 1 x c X h x w, where c, h, and w are the 
number of channels, height, and width of the feature maps, respectively. How can you 
transform this variable into the classes and offsets of anchor boxes? What is the shape 
of the output? 


Discussions?!°. 


14.6 The Object Detection Dataset 
T) 


There is no small dataset such as MNIST and Fashion-MNIST in the field of object detec- 
tion. In order to quickly demonstrate object detection models, we collected and labeled a 
small dataset. First, we took photos of free bananas from our office and generated 1000 
banana images with different rotations and sizes. Then we placed each banana image at a 
random position on some background image. In the end, we labeled bounding boxes for 
those bananas on the images. 


14.6.1 Downloading the Dataset 


The banana detection dataset with all the image and csv label files can be downloaded 
directly from the Internet. 


%matplotlib inline 

import os 

import pandas as pd 

import torch 

import torchvision 

from d21 import torch as d21 


#@save 

d21l1.DATA_HUB[ 'banana-detection’] = ( 
d21.DATA_URL + 'banana-detection.zip', 
"5de26c8fced5ccdea9f91267273464dc968d20d72') 


14.6.2 Reading the Dataset 


We are going to read the banana detection dataset in the read_data_bananas function 
below. The dataset includes a csv file for object class labels and ground-truth bounding 
box coordinates at the upper-left and lower-right corners. 


#@save 

def read_data_bananas(is_train=True): 
"""Read the banana detection dataset images and labels. 
data_dir = d21.download_extract(’banana-detection’) 
csv_fname = os.path.join(data_dir, 'bananas_train’ if is_train 


nnn 


(continues on next page) 
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else 'bananas_val', '‘label.csv') 
csv_data = pd.read_csv(csv_fname) 
csv_data = csv_data.set_index('img_name') 
images, targets = [], [] 
for img_name, target in csv_data.iterrows(): 
images. append(torchvision. io. read_image( 
os.path.join(data_dir, 'bananas_train’ if is_train else 
"bananas_val', ‘images’, f’{img_name}’))) 
# Here ‘target’ contains (class, upper-left x, upper-left y, 
# lower-right x, lower-right y), where all the images have the same 
# banana class (index Q@) 
targets. append(list(target)) 
return images, torch.tensor(targets).unsqueeze(1) / 256 


I 


By using the read_data_bananas function to read images and labels, the following Ba- 
nanasDataset class will allow us to create a customized Dataset instance for loading the 
banana detection dataset. 


#@save 
class BananasDataset(torch.utils.data.Dataset): 
”"»"A customized dataset to load the banana detection dataset.””” 
def __init__(self, is_train): 
self.features, self.labels = read_data_bananas(is_train) 
print(’read ' + str(len(self.features)) + (f’ training examples’ if 
is_train else f’ validation examples')) 


def __getitem__(self, idx): 
return (self.features[idx].float(), self.labels[idx]) 


def __len__(self): 
return len(self.features) 


Finally, we define the load_data_bananas function to return two data iterator instances 
for both the training and test sets. For the test dataset, there is no need to read it in random 
order. 


#@save 
def load_data_bananas(batch_size) : 
"""Load the banana detection dataset. 
train_iter = torch.utils.data.DataLoader (BananasDataset(is_train=True), 
batch_size, shuffle=True) 
val_iter = torch.utils.data.DataLoader (BananasDataset(is_train=False), 
batch_size) 


nnn 


return train_iter, val_iter 


Let’s read a minibatch and print the shapes of both images and labels in this minibatch. 
The shape of the image minibatch, (batch size, number of channels, height, width), looks 
familiar: it is the same as in our earlier image classification tasks. The shape of the label 
minibatch is (batch size, m, 5), where m is the largest possible number of bounding boxes 
that any image has in the dataset. 
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Although computation in minibatches is more efficient, it requires that all the image exam- 
ples contain the same number of bounding boxes to form a minibatch via concatenation. 
In general, images may have a varying number of bounding boxes; thus, images with fewer 
than m bounding boxes will be padded with illegal bounding boxes until m is reached. Then 
the label of each bounding box is represented by an array of length 5. The first element in 
the array is the class of the object in the bounding box, where -1 indicates an illegal bound- 
ing box for padding. The remaining four elements of the array are the (x, y)-coordinate 
values of the upper-left corner and the lower-right corner of the bounding box (the range 
is between 0 and 1). For the banana dataset, since there is only one bounding box on each 
image, we have m = 1. 


batch_size, edge_size = 32, 256 
train_iter, _ = load_data_bananas(batch_size) 
batch = next(iter(train_iter)) 
batchl0].shape, batch[1].shape 


Downloading ../data/banana-detection.zip from http://d21-data.s3-accelerate. 
<amazonaws.com/banana-detection.zip... 

read 1000 training examples 

read 100 validation examples 


(torch.Size([32, 3, 256, 256]), torch.Size([32, 1, 5])) 


14.6.3 Demonstration 


Let’s demonstrate ten images with their labeled ground-truth bounding boxes. We can see 
that the rotations, sizes, and positions of bananas vary across all these images. Of course, 
this is just a simple artificial dataset. In practice, real-world datasets are usually much more 
complicated. 


imgs = (batch[0][:10].permute(@, 2, 3, 1)) / 255 
axes = d21.show_images(imgs, 2, 5, scale=2) 
for ax, label in zip(axes, batch[1][:10]): 
d21.show_bboxes(ax, [label[@][1:5] * edge_size], colors=['w']) 


l 


14.6.4 Summary 


e The banana detection dataset we collected can be used to demonstrate object detection 
models. 


e The data loading for object detection is similar to that for image classification. However, 
in object detection the labels also contain information of ground-truth bounding boxes, 
which is missing in image classification. 


14.6.5 Exercises 


630 


217 


Cet 
bee 
ENF 


Computer Vision 


1. Demonstrate other images with ground-truth bounding boxes in the banana detection 
dataset. How do they differ with respect to bounding boxes and objects? 


2. Say that we want to apply data augmentation, such as random cropping, to object detec- 
tion. How can it be different from that in image classification? Hint: what if a cropped 
image only contains a small portion of an object? 


Discussions?” . 


14.7 Single Shot Multibox Detection 
E] 


In Section 14.3—Section 14.6, we introduced bounding boxes, anchor boxes, multiscale 
object detection, and the dataset for object detection. Now we are ready to use such back- 
ground knowledge to design an object detection model: single shot multibox detection 
(SSD) (Liu et al., 2016). This model is simple, fast, and widely used. Although this is 
just one of vast amounts of object detection models, some of the design principles and 
implementation details in this section are also applicable to other models. 


14.7.1 Model 


Fig. 14.7.1 provides an overview of the design of single-shot multibox detection. This 
model mainly consists of a base network followed by several multiscale feature map blocks. 
The base network is for extracting features from the input image, so it can use a deep CNN. 
For example, the original single-shot multibox detection paper adopts a VGG network trun- 
cated before the classification layer (Liu etal., 2016), while ResNet has also been commonly 
used. Through our design we can make the base network output larger feature maps so as 
to generate more anchor boxes for detecting smaller objects. Subsequently, each multiscale 
feature map block reduces (e.g., by half) the height and width of the feature maps from the 
previous block, and enables each unit of the feature maps to increase its receptive field on 
the input image. 
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Recall the design of multiscale object detection through layerwise representations of images 
by deep neural networks in Section 14.5. Since multiscale feature maps closer to the top of 
Fig. 14.7.1 are smaller but have larger receptive fields, they are suitable for detecting fewer 
but larger objects. 


In a nutshell, via its base network and several multiscale feature map blocks, single-shot 
multibox detection generates a varying number of anchor boxes with different sizes, and 
detects varying-size objects by predicting classes and offsets of these anchor boxes (thus 
the bounding boxes); thus, this is a multiscale object detection model. 
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As a multiscale object detection model, single-shot multibox detection mainly consists of 


a base network followed by several multiscale feature map blocks. 


In the following, we will describe the implementation details of different blocks in Fig. 
14.7.1. To begin with, we discuss how to implement the class and bounding box predic- 
tion. 


Class Prediction Layer 


Let the number of object classes be q. Then anchor boxes have g+1 classes, where class 0 is 
background. At some scale, suppose that the height and width of feature maps are h and w, 
respectively. When a anchor boxes are generated with each spatial position of these feature 
maps as their center, a total of hwa anchor boxes need to be classified. This often makes 
classification with fully connected layers infeasible due to likely heavy parametrization 
costs. Recall how we used channels of convolutional layers to predict classes in Section 
8.3. Single-shot multibox detection uses the same technique to reduce model complex- 


ity. 


Specifically, the class prediction layer uses a convolutional layer without altering width 
or height of feature maps. In this way, there can be a one-to-one correspondence between 
outputs and inputs at the same spatial dimensions (width and height) of feature maps. More 
concretely, channels of the output feature maps at any spatial position (x, y) represent class 
predictions for all the anchor boxes centered on (x, y) of the input feature maps. To produce 
valid predictions, there must be a(g+1) output channels, where for the same spatial position 
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the output channel with index i(q+1)+/ represents the prediction of the class j (0 < j < q) 
for the anchor box i (0 < i < a). 


Below we define such a class prediction layer, specifying a and q via arguments num_anchors 
and num_classes, respectively. This layer uses a 3 x 3 convolutional layer with a padding 
of 1. The width and height of the input and output of this convolutional layer remain un- 
changed. 


%matplotlib inline 

import torch 

import torchvision 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 


def cls_predictor(num_inputs, num_anchors, num_classes): 
return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1), 
kernel_size=3, padding=1) 


Bounding Box Prediction Layer 


The design of the bounding box prediction layer is similar to that of the class prediction 
layer. The only difference lies in the number of outputs for each anchor box: here we need 
to predict four offsets rather than q + 1 classes. 


def bbox_predictor(num_inputs, num_anchors): 
return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1) 


Concatenating Predictions for Multiple Scales 


As we mentioned, single-shot multibox detection uses multiscale feature maps to generate 
anchor boxes and predict their classes and offsets. At different scales, the shapes of feature 
maps or the numbers of anchor boxes centered on the same unit may vary. Therefore, shapes 
of the prediction outputs at different scales may vary. 


In the following example, we construct feature maps at two different scales, Y1 and Y2, for 
the same minibatch, where the height and width of Y2 are half of those of Y1. Let’s take 
class prediction as an example. Suppose that 5 and 3 anchor boxes are generated for every 
unit in Y1 and Y2, respectively. Suppose further that the number of object classes is 10. 
For feature maps Y1 and Y2 the numbers of channels in the class prediction outputs are 
5 x (10+ 1) = 55 and 3 x (10 + 1) = 33, respectively, where either output shape is (batch 
size, number of channels, height, width). 


def forward(x, block): 
return block(x) 


(continues on next page) 
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(continued from previous page) 


Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10)) 
Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10)) 
Y1.shape, Y2.shape 


(torch.Size([2, 55, 20, 20]), torch.Size([2, 33, 10, 10])) 


AS we can see, except for the batch size dimension, the other three dimensions all have 
different sizes. To concatenate these two prediction outputs for more efficient computation, 
we will transform these tensors into a more consistent format. 


Note that the channel dimension holds the predictions for anchor boxes with the same center. 
We first move this dimension to the innermost. Since the batch size remains the same for 
different scales, we can transform the prediction output into a two-dimensional tensor with 
shape (batch size, height x width x number of channels). Then we can concatenate such 
outputs at different scales along dimension 1. 


def flatten_pred(pred): 
return torch.flatten(pred.permute(@, 2, 3, 1), start_dim=1) 


def concat_preds (preds): 
return torch.cat([flatten_pred(p) for p in preds], dim=1) 


In this way, even though Y1 and Y2 have different sizes in channels, heights, and widths, 
we can still concatenate these two prediction outputs at two different scales for the same 
minibatch. 


concat_preds(LY1, Y2]).shape 


torch.Size([2, 25300]) 


Downsampling Block 


In order to detect objects at multiple scales, we define the following downsampling block 
down_sample_b1k that halves the height and width of input feature maps. In fact, this block 
applies the design of VGG blocks in Section 8.2.1. More concretely, each downsampling 
block consists of two 3 x 3 convolutional layers with padding of 1 followed by a 2 x 2 max- 
pooling layer with stride of 2. As we know, 3 x 3 convolutional layers with padding of 1 do 
not change the shape of feature maps. However, the subsequent 2 x 2 max-pooling reduces 
the height and width of input feature maps by half. For both input and output feature maps 
of this downsampling block, because 1 x 2 + (3 — 1) + (3 — 1) = 6, each unit in the output 
has a 6 x 6 receptive field on the input. Therefore, the downsampling block enlarges the 
receptive field of each unit in its output feature maps. 
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def down_sample_blk(in_channels, out_channels): 
blk = [] 
for _ in range(2): 


blk. append(nn.Conv2d(in_channels, out_channels, 
kernel_size=3, padding=1)) 
blk. append(nn.BatchNorm2d(out_channels) ) 
blk. append(nn.ReLU()) 
in_channels = out_channels 
blk. append(nn.MaxPool2d(2)) 
return nn.Sequential (*blk) 


In the following example, our constructed downsampling block changes the number of input 
channels and halves the height and width of the input feature maps. 


forward(torch.zeros((2, 3, 20, 20)), down_sample_blk(3, 10)).shape 


torch.Size([2, 10, 10, 10]) 


Base Network Block 


The base network block is used to extract features from input images. For simplicity, we 
construct a small base network consisting of three downsampling blocks that double the 
number of channels at each block. Given a 256 x 256 input image, this base network block 
outputs 32 x 32 feature maps (256/2? = 32). 


def base_net(): 
blk = [] 
num_filters = [3, 16, 32, 64] 
for i in range(len(num_filters) - 1): 
blk. append(down_sample_blk(num_filters[i], num_filters[i+1])) 
return nn.Sequential (*blk) 


forward(torch.zeros((2, 3, 256, 256)), base_net()).shape 


torch.Size([2, 64, 32, 32]) 


The Complete Model 


The complete single shot multibox detection model consists of five blocks. The feature 
maps produced by each block are used for both (i) generating anchor boxes and (ii) predict- 
ing classes and offsets of these anchor boxes. Among these five blocks, the first one is the 
base network block, the second to the fourth are downsampling blocks, and the last block 
uses global max-pooling to reduce both the height and width to 1. Technically, the second 
to the fifth blocks are all those multiscale feature map blocks in Fig. 14.7.1. 
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def get_blk(i): 


if i == ð: 

blk = base_net() 
elif i = 1: 

blk = down_sample_b1k(64, 128) 
elif i = 4: 

blk = nn.AdaptiveMaxPool2d((1,1)) 
else: 

blk = down_sample_b1k(128, 128) 
return blk 


Now we define the forward propagation for each block. Different from in image classifica- 
tion tasks, outputs here include (i) CNN feature maps Y, (ii) anchor boxes generated using 
Y at the current scale, and (iii) classes and offsets predicted (based on Y) for these anchor 
boxes. 


def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor): 
Y = blk(X) 
anchors = d21.multibox_prior(Y, sizes=size, ratios=ratio) 
cls_preds = cls_predictor(Y) 
bbox_preds = bbox_predictor(Y) 
return (Y, anchors, cls_preds, bbox_preds) 


Recall that in Fig. 14.7.1 a multiscale feature map block that is closer to the top is for 
detecting larger objects; thus, it needs to generate larger anchor boxes. In the above forward 
propagation, at each multiscale feature map block we pass in a list of two scale values via the 
sizes argument of the invoked multibox_prior function (described in Section 14.4). In 
the following, the interval between 0.2 and 1.05 is split evenly into five sections to determine 
the smaller scale values at the five blocks: 0.2, 0.37, 0.54, 0.71, and 0.88. Then their larger 
scale values are given by V0.2 x 0.37 = 0.272, V0.37 x 0.54 = 0.447, and so on. 


sizes = [[@.2, @.272], [@.37, @.447], [@.54, @.619], [@.71, 0.79], 
[@.88, 0.961]] 

ratios = [[1, 2, 0.5]] * 5 

num_anchors = len(sizes[0]) + len(ratios[0]) - 1 


Now we can define the complete model TinySSD as follows. 


class TinySSD(nn.Module) : 
def __init__(self, num_classes, **kwargs): 
super(TinySSD, self).__init__(**kwargs) 
self.num_classes = num_classes 
idx_to_in_channels = [64, 128, 128, 128, 128] 
for i in range(5): 
# Equivalent to the assignment statement ‘self.blk_i = get_blk(i)* 
setattr (self, f'blk_{i}', get_blk(i)) 
setattr (self, f'cls_{i}', cls_predictor(idx_to_in_channels[i], 
num_anchors, num_classes)) 
setattr (self, f'’bbox_{i}', bbox_predictor(idx_to_in_channels[il, 
num_anchors)) 


(continues on next page) 
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def forward(self, X): 
anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5 
for i in range(5): 
# Here ‘getattr(self, ‘blk_%d' % i)‘ accesses ‘self.blk_i* 
X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward( 
X, getattr(self, f’blk_{i}'), sizes[i], ratios[i], 
getattr(self, f'cls_{i}'), getattr(self, f’bbox_{i}’)) 
anchors = torch.cat(anchors, dim=1) 
cls_preds = concat_preds(cls_preds) 
cls_preds = cls_preds.reshape( 
cls_preds.shape[Q], -1, self.num_classes + 1) 
bbox_preds = concat_preds(bbox_preds) 
return anchors, cls_preds, bbox_preds 


We create a model instance and use it to perform forward propagation on a minibatch of 
256 x 256 images X. 


As shown earlier in this section, the first block outputs 32 x 32 feature maps. Recall that the 
second to fourth downsampling blocks halve the height and width and the fifth block uses 
global pooling. Since 4 anchor boxes are generated for each unit along spatial dimensions 
of feature maps, at all the five scales a total of (32? + 167 + 8? +4? + 1) x 4 = 5444 anchor 
boxes are generated for each image. 


net = TinySSD(num_classes=1) 
X = torch.zeros((32, 3, 256, 256)) 
anchors, cls_preds, bbox_preds = net(X) 


print(’output anchors:', anchors.shape) 


print(’output class preds:’, cls_preds.shape) 
print(’output bbox preds:', bbox_preds. shape) 


output anchors: torch.Size([1, 5444, 4]) 
output class preds: torch.Size([32, 5444, 2]) 
output bbox preds: torch.Size([32, 21776]) 


14.7.2 Training 


Now we will explain how to train the single shot multibox detection model for object de- 
tection. 


Reading the Dataset and Initializing the Model 


To begin with, let’s read the banana detection dataset described in Section 14.6. 


batch_size = 32 
train_iter, _ = d21.load_data_bananas(batch_size) 
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read 1000 training examples 
read 100 validation examples 


There is only one class in the banana detection dataset. After defining the model, we need 
to initialize its parameters and define the optimization algorithm. 


device, net = d21.try_gpu(), TinySSD(num_classes=1) 
trainer = torch.optim.SGD(net.parameters(), 1lr=0.2, weight_decay=5e-4) 


Defining Loss and Evaluation Functions 


Object detection has two types of losses. The first loss concerns classes of anchor boxes: 
its computation can simply reuse the cross-entropy loss function that we used for image 
classification. The second loss concerns offsets of positive (non-background) anchor boxes: 
this is a regression problem. For this regression problem, however, here we do not use the 
squared loss described in Section 3.1.3. Instead, we use the f} norm loss, the absolute 
value of the difference between the prediction and the ground-truth. The mask variable 
bbox_masks filters out negative anchor boxes and illegal (padded) anchor boxes in the loss 
calculation. In the end, we sum up the anchor box class loss and the anchor box offset loss 
to obtain the loss function for the model. 


cls_loss = nn.CrossEntropyLoss(reduction='none’) 
bbox_loss = nn.L1Loss(reduction='none') 


def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks): 
batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2] 
cls = cls_loss(cls_preds.reshape(-1, num_classes), 
cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1) 
bbox = bbox_loss(bbox_preds * bbox_masks, 
bbox_labels * bbox_masks) .mean(dim=1) 
return cls + bbox 


We can use accuracy to evaluate the classification results. Due to the used ¢; norm loss 
for the offsets, we use the mean absolute error to evaluate the predicted bounding boxes. 
These prediction results are obtained from the generated anchor boxes and the predicted 
offsets for them. 


def cls_eval(cls_preds, cls_labels): 
# Because the class prediction results are on the final dimension, 
# ‘argmax‘ needs to specify this dimension 
return float((cls_preds.argmax(dim=-1) .type( 
cls_labels.dtype) == cls_labels).sum()) 


def bbox_eval(bbox_preds, bbox_labels, bbox_masks): 
return float((torch.abs((bbox_labels - bbox_preds) * bbox_masks)).sum()) 
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Training the Model 


When training the model, we need to generate multiscale anchor boxes (anchors) and pre- 
dict their classes (cls_preds) and offsets (bbox_preds) in the forward propagation. Then 
we label the classes (cls_labels) and offsets (bbox_labels) of such generated anchor 
boxes based on the label information Y. Finally, we calculate the loss function using the 
predicted and labeled values of the classes and offsets. For concise implementations, eval- 
uation of the test dataset is omitted here. 


num_epochs, timer = 20, d21.Timer() 
animator = d21.Animator(xlabel='’epoch', xlim=[1, num_epochs], 
legend=['’class error’, 'bbox mae’]) 
net = net.to(device) 
for epoch in range(num_epochs) : 
# Sum of training accuracy, no. of examples in sum of training accuracy, 
# Sum of absolute error, no. of examples in sum of absolute error 
metric = d21.Accumulator (4) 
net.train() 
for features, target in train_iter: 
timer.start() 
trainer.zero_grad() 
X, Y = features.to(device), target.to(device) 
# Generate multiscale anchor boxes and predict their classes and 
# offsets 
anchors, cls_preds, bbox_preds = net(X) 
# Label the classes and offsets of these anchor boxes 
bbox_labels, bbox_masks, cls_labels = d21.multibox_target(anchors, Y) 
# Calculate the loss function using the predicted and labeled values 
# of the classes and offsets 
1 = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, 
bbox_masks) 
1.mean() .backward() 
trainer.step() 
metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel() , 
bbox_eval(bbox_preds, bbox_labels, bbox_masks), 
bbox_labels.numel()) 
cls_err, bbox_mae = 1 - metric[@] / metric[1], metric[2] / metric[3] 
animator.add(epoch + 1, (cls_err, bbox_mae)) 
print(f'class err {cls_err:.2e}, bbox mae {bbox_mae: .2e}’) 
print(f'{len(train_iter.dataset) / timer.stop():.1f} examples/sec on 
f'{str(device) }') 


' 


class err 3.27e-03, bbox mae 3.@08e-03 
4279.7 examples/sec on cuda:@ 


14.7.3 Prediction 


During prediction, the goal is to detect all the objects of interest on the image. Below we 
read and resize a test image, converting it to a four-dimensional tensor that is required by 
convolutional layers. 
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0.020 — class error 
--- bbox mae 


0.015 


0.010 


0.005 


X = torchvision.io.read_image(’../img/banana. jpg’) .unsqueeze(Q) . float() 
img = X.squeeze(@).permute(1, 2, 2).long() 


Using the mul tibox_detection function below, the predicted bounding boxes are obtained 
from the anchor boxes and their predicted offsets. Then non-maximum suppression is used 
to remove similar predicted bounding boxes. 


def predict(X): 
net.eval() 
anchors, cls_preds, bbox_preds = net(X.to(device)) 
cls_probs = F.softmax(cls_preds, dim=2).permute(@, 2, 1) 
output = d21.multibox_detection(cls_probs, bbox_preds, anchors) 
idx = [i for i, row in enumerate(output[0]) if row[0] != -1] 
return output[@, idx] 


output = predict(X) 


Finally, we display all the predicted bounding boxes with confidence 0.9 or above as out- 
put. 


def display(img, output, threshold): 

d21.set_figsize((5, 5)) 

fig = d21.plt.imshow(img) 

for row in output: 
score = float(row[1]) 
if score < threshold: 

continue 

h, w = img.shape[: 2] 
bbox = [Lrow[2:6] * torch.tensor((w, h, w, h), device=row.device) ] 
d21.show_bboxes(fig.axes, bbox, '%.2f' % score, ‘w') 


display(img, output.cpu(), threshold=0.9) 


14.7.4 Summary 


e Single shot multibox detection is a multiscale object detection model. Via its base net- 
work and several multiscale feature map blocks, single-shot multibox detection gen- 
erates a varying number of anchor boxes with different sizes, and detects varying-size 
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objects by predicting classes and offsets of these anchor boxes (thus the bounding 
boxes). 


e When training the single-shot multibox detection model, the loss function is calculated 
based on the predicted and labeled values of the anchor box classes and offsets. 


14.7.5 Exercises 


1. Can you improve the single-shot multibox detection by improving the loss function? For 
example, replace £; norm loss with smooth £; norm loss for the predicted offsets. This 
loss function uses a square function around zero for smoothness, which is controlled by 
the hyperparameter o: 


14.7.1 
|x| -—0.5/o07, otherwise ( ) 


ron oe if |x| < 1/0? 


When ø is very large, this loss is similar to the £; norm loss. When its value is smaller, the 
loss function is smoother. 


def smooth_l1(data, scalar): 
out = [] 
for i in data: 
if abs(i) < 1 / (scalar ** 2): 
out.append(((scalar * i) ** 2) / 2) 
else: 
out.append(abs(i) - 0.5 / (scalar ** 2)) 
return torch. tensor (out) 


sigmas = [10, 1, 0.5] 

lines = E=", “== f=, 97 

x = torch.arange(-2, 2, 0.1) 
d21.set_figsize() 


(continues on next page) 
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(continued from previous page) 


for 1, s in zip(lines, sigmas): 

y = smooth_11(x, scalar=s) 

d21.plt.plot(x, y, 1, label='sigma=%.1f’ % s) 
d21.plt.legend() ; 


2.04 — sigma=10.0 
=-=- sigma=1.0 
1374 —-- sigma=0.5 r 
\ rà 
1.04 k á 
054 
Sid A 
0.04 


Besides, in the experiment we used cross-entropy loss for class prediction: denoting by p; 
the predicted probability for the ground-truth class j, the cross-entropy loss is — log p;. We 
can also use the focal loss (Lin et al., 2017): given hyperparameters y > 0 and a > 0, this 
loss is defined as: 


—a(1 — pj)” log pj. (14.7.2) 


As we can see, increasing y can effectively reduce the relative loss for well-classified ex- 
amples (e.g., p; > 0.5) so the training can focus more on those difficult examples that are 
misclassified. 


def focal_loss(gamma, x): 
return -(1 - x) ** gamma * torch.log(x) 


x = torch.arange(0.01, 1, 0.01) 
for 1, gamma in zip(lines, [@, 1, 5]): 

y = d21.plt.plot(x, focal_loss(gamma, x), 1, label='’gamma=%.1f' % gamma) 
d21.plt.legend() ; 


— gamma=0.0 
44 === gamma=1.0 
—-- gamma=5.0 


2. Due to space limitations, we have omitted some implementation details of the single 
shot multibox detection model in this section. Can you further improve the model in the 
following aspects: 
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1. When an object is much smaller compared with the image, the model could resize 
the input image bigger. 


2. There are typically a vast number of negative anchor boxes. To make the class dis- 
tribution more balanced, we could downsample negative anchor boxes. 


3. In the loss function, assign different weight hyperparameters to the class loss and the 
offset loss. 


4. Use other methods to evaluate the object detection model, such as those in the single 
shot multibox detection paper (Liu et al., 2016). 


Discussions?!®. 


14.8 Region-based CNNs (R-CNNs) 


Besides single shot multibox detection described in Section 14.7, region-based CNNs or 
regions with CNN features (R-CNNs) are also among many pioneering approaches of ap- 
plying deep learning to object detection (Girshick et al., 2014). In this section, we will 
introduce the R-CNN and its series of improvements: the fast R-CNN (Girshick, 2015), the 
faster R-CNN (Ren et al., 2015), and the mask R-CNN (He et al., 2017). Due to limited 
space, we will only focus on the design of these models. 


14.8.1 R-CNNs 


The R-CNN first extracts many (e.g., 2000) region proposals from the input image (e.g., an- 
chor boxes can also be considered as region proposals), labeling their classes and bounding 
boxes (e.g., offsets). 


(Girshick et al., 2014) 


Then a CNN is used to perform forward propagation on each region proposal to extract 
its features. Next, features of each region proposal are used for predicting the class and 
bounding box of this region proposal. 


Selective search 


Class prediction 


Bounding box 
prediction 


«| The R-CNN model. 


Fig. 14.8.1 shows the R-CNN model. More concretely, the R-CNN consists of the following 
four steps: 
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1. Perform selective search to extract multiple high-quality region proposals on the input 
image (Uijlings et al., 2013). These proposed regions are usually selected at multiple 
scales with different shapes and sizes. Each region proposal will be labeled with a class 
and a ground-truth bounding box. 


2. Choose a pretrained CNN and truncate it before the output layer. Resize each region 
proposal to the input size required by the network, and output the extracted features for 
the region proposal through forward propagation. 


3. Take the extracted features and labeled class of each region proposal as an example. 
Train multiple support vector machines to classify objects, where each support vector 
machine individually determines whether the example contains a specific class. 


4. Take the extracted features and labeled bounding box of each region proposal as an 
example. Train a linear regression model to predict the ground-truth bounding box. 


Although the R-CNN model uses pretrained CNNs to effectively extract image features, it 
is slow. Imagine that we select thousands of region proposals from a single input image: 
this requires thousands of CNN forward propagations to perform object detection. This 
massive computing load makes it infeasible to widely use R-CNNs in real-world applica- 
tions. 


14.8.2 Fast R-CNN 


The main performance bottleneck of an R-CNN lies in the independent CNN forward prop- 
agation for each region proposal, without sharing computation. Since these regions usually 
have overlaps, independent feature extractions lead to much repeated computation. One 
of the major improvements of the fast R-CNN from the R-CNN is that the CNN forward 
propagation is only performed on the entire image (Girshick, 2015). 


Class Bounding box 
prediction prediction 


Rol pooling 


Selective search 


The fast R-CNN model. 


Fig. 14.8.2 describes the fast R-CNN model. Its major computations are as follows: 


1. Compared with the R-CNN, in the fast R-CNN the input of the CNN for feature extrac- 
tion is the entire image, rather than individual region proposals. Moreover, this CNN is 
trainable. Given an input image, let the shape of the CNN output be 1 x c x hı X w1. 
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2. Suppose that selective search generates n region proposals. These region proposals (of 
different shapes) mark regions of interest (of different shapes) on the CNN output. Then 
these regions of interest further extract features of the same shape (say height h2 and 
width w2 are specified) in order to be easily concatenated. To achieve this, the fast R- 
CNN introduces the region of interest (RoI) pooling layer: the CNN output and region 
proposals are input into this layer, outputting concatenated features of shape n x c x h2 x 
w2 that are further extracted for all the region proposals. 


3. Using a fully connected layer, transform the concatenated features into an output of 
shape n x d, where d depends on the model design. 


4. Predict the class and bounding box for each of the n region proposals. More concretely, 
in class and bounding box prediction, transform the fully connected layer output into 
an output of shape n X q (q is the number of classes) and an output of shape n x 4, 
respectively. The class prediction uses softmax regression. 


The region of interest pooling layer proposed in the fast R-CNN is different from the pooling 
layer introduced in Section 7.5. In the pooling layer, we indirectly control the output shape 
by specifying sizes of the pooling window, padding, and stride. In contrast, we can directly 
specify the output shape in the region of interest pooling layer. 


For example, let’s specify the output height and width for each region as hz and w2, re- 
spectively. For any region of interest window of shape h x w, this window is divided 
into a h2 X w2 grid of subwindows, where the shape of each subwindow is approximately 
(h/h2) x (w/w2). In practice, the height and width of any subwindow shall be rounded up, 
and the largest element shall be used as output of the subwindow. Therefore, the region of 
interest pooling layer can extract features of the same shape even when regions of interest 
have different shapes. 


As an illustrative example, in Fig. 14.8.3, the upper-left 3 x 3 region of interest is selected 
on a4 x4 input. For this region of interest, we use a 2 x 2 region of interest pooling layer to 
obtain a 2 x 2 output. Note that each of the four divided subwindows contains elements 0, 
1, 4, and 5 (5 is the maximum); 2 and 6 (6 is the maximum); 8 and 9 (9 is the maximum); 
and 10. 


Pooling o] 


A 2 x 2 region of interest pooling layer. 


Below we demonstrate the computation of the region of interest pooling layer. Suppose 
that the height and width of the CNN-extracted features X are both 4, and there is only a 
single channel. 


import torch 
import torchvision 


(continues on next page) 
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(continued from previous page) 


X = torch.arange(16.).reshape(1, 1, 4, 4) 
X 


tensor (CCCL 0., 1., 2., 3.], 
Laa Se bey Tedy 
EB O45, 10: Ml; 
[12., 13., 14., 15.]]]]) 


Let’s further suppose that the height and width of the input image are both 40 pixels and 
that selective search generates two region proposals on this image. Each region proposal is 
expressed as five elements: its object class followed by the (x, y)-coordinates of its upper- 
left and lower-right corners. 


rois = torch.Tensor([[0, ð, ð, 20, 20], [0, ©, 10, 30, 30]]) 


Because the height and width of X are 1/10 of the height and width of the input image, 
the coordinates of the two region proposals are multiplied by 0.1 according to the specified 
spatial_scale argument. Then the two regions of interest are marked on X as X[:, :, 
0:3, @:3]andX[:, :, 1:4, 0:4], respectively. Finally in the 2 x 2 region of interest 
pooling, each region of interest is divided into a grid of sub-windows to further extract 
features of the same shape 2 x 2. 


torchvision.ops.roi_pool(X, rois, output_size=(2, 2), spatial_scale=@.1) 


tensor (CCLC 5., 6.], 
E 9., 10.]]], 


ELL: 955.001, 
[13., 15.1]1]) 


14.8.3 Faster R-CNN 


To be more accurate in object detection, the fast R-CNN model usually has to generate 
a lot of region proposals in selective search. To reduce region proposals without loss of 
accuracy, the faster R-CNN proposes to replace selective search with a region proposal 
network (Ren et al., 2015). 


Fig. 14.8.4 shows the faster R-CNN model. Compared with the fast R-CNN, the faster R- 
CNN only changes the region proposal method from selective search to a region proposal 
network. The rest of the model remain unchanged. The region proposal network works in 
the following steps: 


1. Use a 3 x 3 convolutional layer with padding of 1 to transform the CNN output to a 
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new output with c channels. In this way, each unit along the spatial dimensions of the 
CNN-extracted feature maps gets a new feature vector of length c. 


2. Centered on each pixel of the feature maps, generate multiple anchor boxes of different 
scales and aspect ratios and label them. 


3. Using the length-c feature vector at the center of each anchor box, predict the binary 
class (background or objects) and bounding box for this anchor box. 


4. Consider those predicted bounding boxes whose predicted classes are objects. Remove 
overlapped results using non-maximum suppression. The remaining predicted bounding 
boxes for objects are the region proposals required by the region of interest pooling layer. 


It is worth noting that, as part of the faster R-CNN model, the region proposal network 
is jointly trained with the rest of the model. In other words, the objective function of the 
faster R-CNN includes not only the class and bounding box prediction in object detection, 
but also the binary class and bounding box prediction of anchor boxes in the region proposal 
network. As a result of the end-to-end training, the region proposal network learns how to 
generate high-quality region proposals, so as to stay accurate in object detection with a 
reduced number of region proposals that are learned from data. 


14.8.4 Mask R-CNN 


In the training dataset, if pixel-level positions of object are also labeled on images, the 
mask R-CNN can effectively leverage such detailed labels to further improve the accuracy 
of object detection (He et al., 2017). 


As shown in Fig. 14.8.5, the mask R-CNN is modified based on the faster R-CNN. Specif- 
ically, the mask R-CNN replaces the region of interest pooling layer with the region of 
interest (RoI) alignment layer. This region of interest alignment layer uses bilinear inter- 
polation to preserve the spatial information on the feature maps, which is more suitable for 
pixel-level prediction. The output of this layer contains feature maps of the same shape for 
all the regions of interest. They are used to predict not only the class and bounding box 
for each region of interest, but also the pixel-level position of the object through an addi- 
tional fully convolutional network. More details on using a fully convolutional network to 
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predict pixel-level semantics of an image will be provided in subsequent sections of this 
chapter. 


= 


14.8.5 Summary 


The R-CNN extracts many region proposals from the input image, uses a CNN to perform 


forward propagation on each region proposal to extract its features, then uses these 
features to predict the class and bounding box of this region proposal. 


One of the major improvements of the fast R-CNN from the R-CNN is that the CNN for- 
ward propagation is only performed on the entire image. It also introduces the region 
of interest pooling layer, so that features of the same shape can be further extracted 
for regions of interest that have different shapes. 


The faster R-CNN replaces the selective search used in the fast R-CNN with a jointly 
trained region proposal network, so that the former can stay accurate in object detec- 
tion with a reduced number of region proposals. 


Based on the faster R-CNN, the mask R-CNN additionally introduces a fully convolu- 
tional network, so as to leverage pixel-level labels to further improve the accuracy of 
object detection. 


14.8.6 Exercises 


. Can we frame object detection as a single regression problem, such as predicting bound- 


ing boxes and class probabilities? You may refer to the design of the YOLO model 
(Redmon et al., 2016). 


. Compare single shot multibox detection with the methods introduced in this section. 


What are their major differences? You may refer to Figure 2 of Zhao et al. (2019). 


219 
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14.9 Semantic Segmentation and the Dataset 
eee 


When discussing object detection tasks in Section 14.3—Section 14.8, rectangular bound- 
ing boxes are used to label and predict objects in images. This section will discuss the 
problem of semantic segmentation, which focuses on how to divide an image into regions 
belonging to different semantic classes. Different from object detection, semantic seg- 
mentation recognizes and understands what are in images in pixel level: its labeling and 
prediction of semantic regions are in pixel level. Fig. 14.9.1 shows the labels of the dog, 
cat, and background of the image in semantic segmentation. Compared with in object de- 
tection, the pixel-level borders labeled in semantic segmentation are obviously more fine- 
grained. 


Background 


E Labels of the dog, cat, and background of the image in semantic segmentation. 


14.9.1 Image Segmentation and Instance Segmentation 


There are also two important tasks in the field of computer vision that are similar to seman- 
tic segmentation, namely image segmentation and instance segmentation. We will briefly 
distinguish them from semantic segmentation as follows. 


e Image segmentation divides an image into several constituent regions. The methods for 
this type of problem usually make use of the correlation between pixels in the image. 
It does not need label information about image pixels during training, and it cannot 
guarantee that the segmented regions will have the semantics that we hope to obtain 
during prediction. Taking the image in Fig. 14.9.1 as input, image segmentation may 
divide the dog into two regions: one covers the mouth and eyes which are mainly 
black, and the other covers the rest of the body which is mainly yellow. 


e Instance segmentation is also called simultaneous detection and segmentation. It studies 
how to recognize the pixel-level regions of each object instance in an image. Differ- 
ent from semantic segmentation, instance segmentation needs to distinguish not only 
semantics, but also different object instances. For example, if there are two dogs in 
the image, instance segmentation needs to distinguish which of the two dogs a pixel 
belongs to. 


14.9.2 The Pascal VOC2012 Semantic Segmentation Dataset 


On of the most important semantic segmentation dataset is Pascal VOC2012 27°. In the 
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following, we will take a look at this dataset. 


%matplotlib inline 

import os 

import torch 

import torchvision 

from d21 import torch as d21 


The tar file of the dataset is about 2 GB, so it may take a while to download the file. The 
extracted dataset is located at . ./data/VOCdevkit/V0C2012. 


#@save 
d21.DATA_HUB['’voc2@12’] = (d21.DATA_URL + 'VOCtrainval_11-May-2012.tar', 
"4e443f8a2eca6b1dac8a6c57641b67dd40621a49') 


voc_dir = d21.download_extract(’voc2012', 'VOCdevkit/VOC2012') 


Downloading ../data/VOCtrainval_11-May-2012.tar from http://d21-data.s3- 
~accelerate.amazonaws.com/VOCtrainval_11-May-2012.tar... 


After entering the path ../data/VOCdevkit/V0C2012, we can see the different compo- 
nents of the dataset. The ImageSets/Segmentation path contains text files that specify 
training and test samples, while the JPEGImages and SegmentationClass paths store the 
input image and label for each example, respectively. The label here is also in the im- 
age format, with the same size as its labeled input image. Besides, pixels with the same 
color in any label image belong to the same semantic class. The following defines the 
read_voc_images function to read all the input images and labels into the memory. 


#@save 
def read_voc_images(voc_dir, is_train=True): 
"""Read all VOC feature and label images."”” 
txt_fname = os.path.join(voc_dir, 'ImageSets', ‘Segmentation’, 
'train.txt’ if is_train else 'val.txt’) 
mode = torchvision.io. image. ImageReadMode . RGB 
with open(txt_fname, 'r') as f: 
images = f.read().splitQ 
features, labels = [], [] 
for i, fname in enumerate(images) : 
features. append(torchvision.io.read_image(os.path. join( 
voc_dir, 'JPEGImages', f'{fname}.jpg’))) 
labels. append(torchvision. io. read_image(os. path. join( 
voc_dir, 'SegmentationClass' ,f’{fname}.png’), mode)) 
return features, labels 


train_features, train_labels = read_voc_images(voc_dir, True) 


We draw the first five input images and their labels. In the label images, white and black rep- 
resent borders and background, respectively, while the other colors correspond to different 
classes. 
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n=5 

imgs = train_features[:n] + train_labels[:n] 
imgs = [Limg.permute(1,2,0) for img in imgs] 
d21.show_images(imgs, 2, n); 


Next, we enumerate the RGB color values and class names for all the labels in this dataset. 


#@save 

VOC_COLORMAP = [[@, 2, 0], [128, @, @], [@, 128, @], [128, 128, 0], 
IO, Ch eal (Is, O, msi, a wa, Wei), (Fes, as, az), 
fe, Oy, Gi, ise, @, Olly IE, Uae, Oil, [Le wes. al, 
Eaei Os ede (2, @, awe, (otk, ae, den), (Ll, 2G ltl, 
EO, 4b, Cll, Ils, Gul, Wl, 1, W925 Oil, IAs, eA wll, 


[@, 64, 128]] 

#@save 

VOC_CLASSES = ['background’, ‘aeroplane’, ‘bicycle’, ‘bird’, ‘boat’, 
“lyon. eis”, Xeele", ee ee “COW, 
‘diningtable’, ‘dog’, ‘horse’, ‘motorbike’, ‘person’, 
"potted plant’, ‘sheep’, ‘sofa’, ‘train’, 'tv/monitor'] 


With the two constants defined above, we can conveniently find the class index for each 
pixel in a label. We define the voc_colormap21labe1 function to build the mapping from 
the above RGB color values to class indices, and the voc_label_indices function to map 
any RGB values to their class indices in this Pascal VOC2012 dataset. 


#@save 
def voc_colormap2label(): 
"""Build the mapping from RGB to class indices for VOC labels. 
colormap2label = torch.zeros(256 ** 3, dtype=torch. long) 
for i, colormap in enumerate(VOC_COLORMAP) : 
colormap21label[ 
(colormap[Q] * 256 + colormap[1]) * 256 + colormap[2]] = i 
return colormap2label 


nnn 


#@save 

def voc_label_indices(colormap, colormap2label): 
"""Map any RGB values in VOC labels to their class indices. 
colormap = colormap.permute(1, 2, 2).numpy().astype('int32') 


nnn 


(continues on next page) 
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(continued from previous page) 


idx = ((colormap[:, :, 0] * 256 + colormapL:, :, 1]) * 256 
+ colormapL:, :, 2]) 
return colormap2label[idx] 


For example, in the first example image, the class index for the front part of the airplane is 
1, while the background index is 0. 


y = voc_label_indices(train_labels[0], voc_colormap21label()) 
yl105:115, 130:140], VOC_CLASSES[1] 


(tensor([[@, ð, @, ©, ð, ð, ©, ð, ð, 1], 
[0, ð, 0, ©, O, ð, ©, 1, 1, 1], 
[0, ð, 0, ©, ©, ð, 1, 1, 1, 1], 
[0, ð, ð, ©, @, 1, 1, 1, 1, 1], 
[0, ð, 0, ©, @ 1, 1, 1, 1, 1], 
[@, ð, ð, ©, 1, 1, 1, 1, 1, 1], 
[0, ð, 0, ©, O, 1, 1, 1, 1, 1], 
[0, ð, ð, ©, @, 1, 1, 1, 1, 1], 
[0, ð, ð, ©, ©, ð, 1, 1, 1, 1], 
[0, 0, 0, 0, ©, ð, ð, ð, 1, 11)), 


'aeroplane’) 


Data Preprocessing 


In previous experiments such as in Section 8.1—Section 8.4, images are rescaled to fit the 
model’s required input shape. However, in semantic segmentation, doing so requires rescal- 
ing the predicted pixel classes back to the original shape of the input image. Such rescaling 
may be inaccurate, especially for segmented regions with different classes. To avoid this 
issue, we crop the image to a fixed shape instead of rescaling. Specifically, using random 
cropping from image augmentation, we crop the same area of the input image and the la- 
bel. 


#@save 

def voc_rand_crop(feature, label, height, width): 
"""Randomly crop both feature and label images. 
rect = torchvision.transforms.RandomCrop. get_params( 

feature, (height, width)) 

feature = torchvision. transforms. functional.crop(feature, *rect) 
label = torchvision. transforms. functional.crop(label, *rect) 
return feature, label 


nnn 


imgs = [] 
for in range(n): 


imgs += voc_rand_crop(train_features[@], train_labels[0], 200, 300) 


imgs = [Limg.permute(1, 2, 0) for img in imgs] 
d21.show_images(imgs[::2] + imgs[1::2], 2, n); 


652 


Computer Vision 


Custom Semantic Segmentation Dataset Class 


We define a custom semantic segmentation dataset class VOCSegDataset by inheriting the 
Dataset class provided by high-level APIs. By implementing the __getitem__ function, 
we can arbitrarily access the input image indexed as idx in the dataset and the class index 
of each pixel in this image. Since some images in the dataset have a smaller size than 
the output size of random cropping, these examples are filtered out by a custom filter 
function. In addition, we also define the normalize_image function to standardize the 
values of the three RGB channels of input images. 


#@save 
class VOCSegDataset(torch.utils.data.Dataset): 
"""< customized dataset to load the VOC dataset.”"”” 


def __init__(self, is_train, crop_size, voc_dir): 
self.transform = torchvision.transforms.Normalize( 
mean=[0.485, @.456, 0.406], std=[0.229, 0.224, 0.225]) 

self.crop_size = crop_size 
features, labels = read_voc_images(voc_dir, is_train=is_train) 
self.features = [self.normalize_image(feature) 

for feature in self.filter(features) ] 
self.labels = self. filter(labels) 
self.colormap2label = voc_colormap21label() 
print(’read ' + str(len(self.features)) + 


1 


examples') 


def normalize_image(self, img): 
return self.transform(img.float() / 255) 


def filter(self, imgs): 
return [img for img in imgs if ( 
img.shapel1] >= self.crop_size[@] and 
img.shapel2] >= self.crop_size[1])] 


def __getitem__(self, idx): 
feature, label = voc_rand_crop(self.features[idx], self.labels[idx], 
*xself.crop_size) 
return (feature, voc_label_indices(label, self.colormap2label)) 


def __len__(self): 
return len(self.features) 
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Reading the Dataset 


We use the custom VOCSegDataset class to create instances of the training set and test set, 
respectively. Suppose that we specify that the output shape of randomly cropped images is 
320 x 480. Below we can view the number of examples that are retained in the training set 
and test set. 


crop_size = (320, 480) 
voc_train = VOCSegDataset(True, crop_size, voc_dir) 
voc_test = VOCSegDataset(False, crop_size, voc_dir) 


read 1114 examples 
read 1078 examples 


Setting the batch size to 64, we define the data iterator for the training set. Let’s print 
the shape of the first minibatch. Different from in image classification or object detection, 
labels here are three-dimensional tensors. 


batch_size = 64 
train_iter = torch.utils.data.DataLoader(voc_train, batch_size, shuffle=True, 
drop_last=True, 
num_workers=d21.get_dataloader_workers()) 
for X, Y in train_iter: 
print (X. shape) 
print(Y. shape) 
break 


torch.Size([64, 3, 320, 480]) 
torch.Size([64, 320, 480]) 


Putting It All Together 


Finally, we define the following load_data_voc function to download and read the Pascal 
VOC2012 semantic segmentation dataset. It returns data iterators for both the training and 
test datasets. 


#@save 
def load_data_voc(batch_size, crop_size): 
"""|oad the VOC semantic segmentation dataset. 
voc_dir = d21.download_extract('voc2012’, os.path. join( 
'VOCdevkit’, 'VOC2012')) 
num_workers = d21.get_dataloader_workers() 
train_iter = torch.utils.data.DataLoader( 
VOCSegDataset(True, crop_size, voc_dir), batch_size, 
shuffle=True, drop_last=True, num_workers=num_workers) 
test_iter = torch.utils.data.DataLoader( 
VOCSegDataset(False, crop_size, voc_dir), batch_size, 
drop_last=True, num_workers=num_workers) 
return train_iter, test_iter 


non 
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14.9.3 Summary 


Semantic segmentation recognizes and understands what are in an image in pixel level 
by dividing the image into regions belonging to different semantic classes. 


One of the most important semantic segmentation dataset is Pascal VOC2012. 


In semantic segmentation, since the input image and label correspond one-to-one on the 
pixel, the input image is randomly cropped to a fixed shape rather than rescaled. 


14.9.4 Exercises 


1. How can semantic segmentation be applied in autonomous vehicles and medical image 
diagnostics? Can you think of other applications? 


2. Recall the descriptions of data augmentation in Section 14.1. Which of the image aug- 
mentation methods used in image classification would be infeasible to be applied in 
semantic segmentation? 


Discussions??? . 


14.10 Transposed Convolution 
ET) 


The CNN layers we have seen so far, such as convolutional layers (Section 7.2) and pool- 
ing layers (Section 7.5), typically reduce (downsample) the spatial dimensions (height and 
width) of the input, or keep them unchanged. In semantic segmentation that classifies at 
pixel-level, it will be convenient if the spatial dimensions of the input and output are the 
same. For example, the channel dimension at one output pixel can hold the classification 
results for the input pixel at the same spatial position. 


To achieve this, especially after the spatial dimensions are reduced by CNN layers, we 
can use another type of CNN layers that can increase (upsample) the spatial dimensions of 
intermediate feature maps. In this section, we will introduce transposed convolution, which 
is also called fractionally-strided convolution (Dumoulin and Visin, 2016), for reversing 
downsampling operations by the convolution. 


import torch 
from torch import nn 
from d21l1 import torch as d21 


14.10.1 Basic Operation 


Ignoring channels for now, let’s begin with the basic transposed convolution operation with 
stride of 1 and no padding. Suppose that we are given a np Xn, input tensor and a kp X ky, 
kernel. Sliding the kernel window with stride of 1 for n,, times in each row and np times 
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in each column yields a total of myn, intermediate results. Each intermediate result is 
a (nn + kn — 1) X (ny + kw — 1) tensor that are initialized as zeros. To compute each 
intermediate tensor, each element in the input tensor is multiplied by the kernel so that 
the resulting kp X kw tensor replaces a portion in each intermediate tensor. Note that the 
position of the replaced portion in each intermediate tensor corresponds to the position of 
the element in the input tensor used for the computation. In the end, all the intermediate 
results are summed over to produce the output. 


As an example, Fig. 14.10.1 illustrates how transposed convolution with a 2 x 2 kernel is 
computed for a 2 x 2 input tensor. 


Input Kernel 

lo] 1] Transposed 0j1 
Output 
© |0 o|1 o;oj}1 
= + 2/3 ]+ 2 =10/4/6 
6 4 |12| 9 


Transposed convolution with a 2 x 2 kernel. The shaded portions are a portion of an 
intermediate tensor as well as the input and kernel tensor elements used for the 
computation. 


We can implement this basic transposed convolution operation trans_conv for a input 
matrix X and a kernel matrix K. 


def trans_conv(X, K): 
h, w = K.shape 
Y = torch.zeros((X.shape[Q] + h - 1, X.shape[1] +w - 1)) 
for i in range(X.shape[@]): 
for j in range(X.shape[1]): 
WBS a a |i, Ja 3) ae Wl) ce QE, ah) eo Is 
return Y 


In contrast to the regular convolution (in Section 7.2) that reduces input elements via the 
kernel, the transposed convolution broadcasts input elements via the kernel, thereby pro- 
ducing an output that is larger than the input. We can construct the input tensor X and the 
kernel tensor K from Fig. 14.10.1 to validate the output of the above implementation of the 
basic two-dimensional transposed convolution operation. 


torch.tensor([[@.0, 1.0], [2.0, 3.0]]) 
torch.tensor([[@.0, 1.0], [2.0, 3.0]]) 
rans_conv(X, K) 


X 
K 
t 


tensor ([[ ð. 1 
O 4.,. 6s 
[ 4. 9 
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Alternatively, when the input X and kernel K are both four-dimensional tensors, we can use 
high-level APIs to obtain the same results. 


X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2) 
tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, bias=False) 
tconv.weight.data = K 


tconv(X) 
tensor([L[[ 9., 0., 1.], 
Co, 4., 6], 
[ 4., 12., 9.]]]], grad_fn=<ConvolutionBackward@>) 


14.10.2 Padding, Strides, and Multiple Channels 


Different from in the regular convolution where padding is applied to input, it is applied to 
output in the transposed convolution. For example, when specifying the padding number 
on either side of the height and width as 1, the first and last rows and columns will be 
removed from the transposed convolution output. 


tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, padding=1, bias=False) 
tconv.weight.data = K 
tconv(X) 


tensor ([CCL4.]]]], grad_fn=<ConvolutionBackward@>) 


In the transposed convolution, strides are specified for intermediate results (thus output), 
not for input. Using the same input and kernel tensors from Fig. 14.10.1, changing the 
stride from 1 to 2 increases both the height and weight of intermediate tensors, hence the 
output tensor in Fig. 14.10.2. 


The following code snippet can validate the transposed convolution output for stride of 2 
in Fig. 14.10.2. 


tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, stride=2, bias=False) 
tconv.weight.data = K 


tconv(X) 
tensor(L[[[@., @., ©., 1.], 
[@., @., 2., 3.], 
Cois. Ze, Os, Sad 
[4., 6., 6., 9.]]]], grad_fn=<ConvolutionBackwardQ>) 


For multiple input and output channels, the transposed convolution works in the same way 
as the regular convolution. Suppose that the input has c; channels, and that the transposed 
convolution assigns a kp X ky, kernel tensor to each input channel. When multiple output 
channels are specified, we will have a c; X kp X ky kernel for each output channel. 
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Input Kernel 
[ofi] Transposed lo] 4] 
Conv 
i 
0/0 o|1 
0/0 2/3 
= + + + 
o|2 0/3 
416 6|9 
o;o;0o]1 
0/0)2/3 
= Output 
0/2/)0/3 
4)/6/6/9 


Transposed convolution with a 2 x 2 kernel with stride of 2. The shaded portions are a 
portion of an intermediate tensor as well as the input and kernel tensor elements used for 
the computation. 


As in all, if we feed X into a convolutional layer f to output Y = f(X) and create a trans- 
posed convolutional layer g with the same hyperparameters as f except for the number of 
output channels being the number of channels in X, then g(Y) will have the same shape as 
X. This can be illustrated in the following example. 


X = torch.rand(size=(1, 10, 16, 16)) 

conv = nn.Conv2d(1@, 20, kernel_size=5, padding=2, stride=3) 

tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3) 
tconv(conv(X)).shape == X.shape 


True 


14.10.3 Connection to Matrix Transposition 


The transposed convolution is named after the matrix transposition. To explain, let’s first 
see how to implement convolutions using matrix multiplications. In the example below, we 
define a 3 x 3 input X and a 2 x 2 convolution kernel K, and then use the corr2d function 
to compute the convolution output Y. 


= torch. arange(9.0).reshape(3, 3) 
= torch.tensor([[1.0, 2.0], [3.0, 4.0]]) 
d21.corr2d(X, K) 


<~~<A x 
i 


tensor([[27., 37.], 
[57., 67.]]) 


Next, we rewrite the convolution kernel K as a sparse weight matrix W containing a lot of 
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zeros. The shape of the weight matrix is (4, 9), where the non-zero elements come from 
the convolution kernel K. 


def kernel2matrix(K): 
k, W = torch.zeros(5), torch.zeros((4, 9)) 
MEE, Tisesl] = XO, 24), (Xl, 4] 
Wi, Soil, Wi, ec, Wi, sstul, Wits, te = ik, tk, lk. lk 
return W 


W = kernel2matrix(K) 
Ww 


tensor([[1., 2., ©., 3., 4., @., ©., ©., @.], 
[@., 1., 2., 0., 3., 4., @, ©., 0.], 
Los, 0., @, 1., 2., Op 3., 4., 0], 
[0., @., @., @., 1., 2., @, 3., 4.1) 


Concatenate the input X row by row to get a vector of length 9. Then the matrix multiplica- 
tion of W and the vectorized X gives a vector of length 4. After reshaping it, we can obtain 
the same result Y from the original convolution operation above: we just implemented con- 
volutions using matrix multiplications. 


Y == torch.matmul(W, X.reshape(-1)).reshape(2, 2) 


tensor([[True, True], 
[True, True]]) 


Likewise, we can implement transposed convolutions using matrix multiplications. In the 
following example, we take the 2 x 2 output Y from the above regular convolution as input 
to the transposed convolution. To implement this operation by multiplying matrices, we 
only need to transpose the weight matrix W with the new shape (9, 4). 


Z = trans_conv(Y, K) 
Z == torch.matmul(W.T, Y.reshape(-1)).reshape(3, 3) 


tensor(L[[True, True, True], 
[True, True, True], 
[True, True, True]]) 


Consider implementing the convolution by multiplying matrices. Given an input vector 
x and a weight matrix W, the forward propagation function of the convolution can be 
implemented by multiplying its input with the weight matrix and outputting a vector y = 
Wx. Since backpropagation follows the chain rule and Vxy = W", the backpropagation 
function of the convolution can be implemented by multiplying its input with the transposed 
weight matrix WT. Therefore, the transposed convolutional layer can just exchange the 
forward propagation function and the backpropagation function of the convolutional layer: 
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its forward propagation and backpropagation functions multiply their input vector with WT 
and W, respectively. 


14.10.4 Summary 


e In contrast to the regular convolution that reduces input elements via the kernel, the 
transposed convolution broadcasts input elements via the kernel, thereby producing 
an output that is larger than the input. 


If we feed X into a convolutional layer f to output Y = f(X) and create a transposed 
convolutional layer g with the same hyperparameters as f except for the number of 
output channels being the number of channels in X, then g (Y) will have the same shape 
as X. 


We can implement convolutions using matrix multiplications. The transposed convolu- 
tional layer can just exchange the forward propagation function and the backpropaga- 
tion function of the convolutional layer. 


14.10.5 Exercises 


1. In Section 14.10.3, the convolution input X and the transposed convolution output Z have 
the same shape. Do they have the same value? Why? 


2. Is it efficient to use matrix multiplications to implement convolutions? Why? 


Discussions 2??. 


14.11 Fully Convolutional Networks 
T) 


As discussed in Section 14.9, semantic segmentation classifies images in pixel level. A 
fully convolutional network (FCN) uses a convolutional neural network to transform image 
pixels to pixel classes (Long et al., 2015). Unlike the CNNs that we encountered earlier 
for image classification or object detection, a fully convolutional network transforms the 
height and width of intermediate feature maps back to those of the input image: this is 
achieved by the transposed convolutional layer introduced in Section 14.10. As a result, the 
classification output and the input image have a one-to-one correspondence in pixel level: 
the channel dimension at any output pixel holds the classification results for the input pixel 
at the same spatial position. 


%matplotlib inline 

import torch 

import torchvision 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 
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14.11.1 The Model 


Here we describe the basic design of the fully convolutional network model. As shown 
in Fig. 14.11.1, this model first uses a CNN to extract image features, then transforms the 
number of channels into the number of classes via a 1 x 1 convolutional layer, and finally 
transforms the height and width of the feature maps to those of the input image via the 
transposed convolution introduced in Section 14.10. As a result, the model output has the 
same height and width as the input image, where the output channel contains the predicted 
classes for the input pixel at the same spatial position. 


Background 


Cat 


Transposed conv 


1x 1 Conv 


CNN 


Fully convolutional network. 


Below, we use a ResNet-18 model pretrained on the ImageNet dataset to extract image 
features and denote the model instance as pretrained_net. The last few layers of this 
model include a global average pooling layer and a fully connected layer: they are not 
needed in the fully convolutional network. 


pretrained_net = torchvision.models.resnet18(pretrained=True) 
list(pretrained_net.children())[-3:] 


Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth” to / 
<home/ci/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth 
100%| | 44.7M/44.7M [@0:00<00:00, 56.3MB/s] 


[Sequential ( 
(ð): BasicBlock( 

(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1,. 
1), bias=False) 

(bn1): BatchNorm2d(512, eps=le-@5, momentum=0.1, affine=True, track_ 
srunning_stats=True) 

(relu): ReLUCinplace=True) 

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1,. 
1), bias=False) 

(bn2): BatchNorm2d(512, eps=le-@5, momentum=0.1, affine=True, track_ 


(continues on next page) 
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(continued from previous page) 


srunning_stats=True) 
(downsample): Sequential( 
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) 
(1): BatchNorm2d(512, eps=le-05, momentum=0.1, affine=True, track_ 
srunning_stats=True) 


) 
) 
(1): BasicBlock( 
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1,. 
+1), bias=False) 
(bn1): BatchNorm2d(512, eps=le-@5, momentum=0.1, affine=True, track_ 
srunning_stats=True) 
(relu): ReLUCinplace=True) 
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1,. 
1), bias=False) 
(bn2): BatchNorm2d(512, eps=le-@5, momentum=0.1, affine=True, track_ 
srunning_stats=True) 


) 
j; 
AdaptiveAvgPool2d(output_size=(1, 1)), 
Linear(in_features=512, out_features=1000, bias=True)] 


Next, we create the fully convolutional network instance net. It copies all the pretrained 
layers in the ResNet-18 except for the final global average pooling layer and the fully con- 
nected layer that are closest to the output. 


net = nn.Sequential(*list(pretrained_net.children())L[:-2]) 


Given an input with height and width of 320 and 480 respectively, the forward propagation 
of net reduces the input height and width to 1/32 of the original, namely 10 and 15. 


X = torch.rand(size=(1, 3, 320, 480)) 
net (X).shape 


torch.Size([1, 512, 10, 15]) 


Next, we use a 1 x 1 convolutional layer to transform the number of output channels into 
the number of classes (21) of the Pascal VOC2012 dataset. Finally, we need to increase the 
height and width of the feature maps by 32 times to change them back to the height and 
width of the input image. Recall how to calculate the output shape of a convolutional layer 
in Section 7.3. Since (320-644 16 2+32)/32 = 10 and (480 -64+ 162+32)/32 = 15, 
we construct a transposed convolutional layer with stride of 32, setting the height and width 
of the kernel to 64, the padding to 16. In general, we can see that for stride s, padding s/2 
(assuming s/2 is an integer), and the height and width of the kernel 2s, the transposed 
convolution will increase the height and width of the input by s times. 
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num_classes = 21 

net.add_module('final_conv’, nn.Conv2d(512, num_classes, kernel_size=1)) 

net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes, 
kernel_size=64, padding=16, stride=32)) 


14.11.2 Initializing Transposed Convolutional Layers 


We already know that transposed convolutional layers can increase the height and width of 
feature maps. In image processing, we may need to scale up an image, i.e., upsampling. 
Bilinear interpolation is one of the commonly used upsampling techniques. It is also often 
used for initializing transposed convolutional layers. 


To explain bilinear interpolation, say that given an input image we want to calculate each 
pixel of the upsampled output image. In order to calculate the pixel of the output image 
at coordinate (x, y), first map (x, y) to coordinate (x’, y’) on the input image, for example, 
according to the ratio of the input size to the output size. Note that the mapped x’ and y’ are 
real numbers. Then, find the four pixels closest to coordinate (x’, y’) on the input image. 
Finally, the pixel of the output image at coordinate (x, y) is calculated based on these four 
closest pixels on the input image and their relative distance from (x’, y’). 


Upsampling of bilinear interpolation can be implemented by the transposed convolutional 
layer with the kernel constructed by the following bilinear_kernel function. Due to 
space limitations, we only provide the implementation of the bilinear_kernel function 
below without discussions on its algorithm design. 


def bilinear_kernel(in_channels, out_channels, kernel_size): 

factor = (kernel_size + 1) // 2 
if kernel_size % 2 == 1: 

center = factor - 1 
else: 

center = factor - 0.5 
og = (torch.arange(kernel_size).reshape(-1, 1), 

torch. arange(kernel_size).reshape(1, -1)) 
filt = (1 - torch.abs(og[@] - center) / factor) * \ 
(1 - torch.abs(og[1] - center) / factor) 
weight = torch.zeros((in_channels, out_channels, 
kernel_size, kernel_size)) 

weight[range(in_channels), range(out_channels), :, :] = filt 
return weight 


Let’s experiment with upsampling of bilinear interpolation that is implemented by a trans- 
posed convolutional layer. We construct a transposed convolutional layer that doubles the 
height and weight, and initialize its kernel with the bilinear_kernel function. 


conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2, 
bias=False) 
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4)); 
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Read the image X and assign the upsampling output to Y. In order to print the image, we 
need to adjust the position of the channel dimension. 


img = torchvision. transforms. ToTensor() (d21.Image.open(’. ./img/catdog. jpg')) 
X = img.unsqueeze(@) 

Y = conv_trans(X) 

out_img = YLQ].permute(1, 2, 2).detach() 


As we can see, the transposed convolutional layer increases both the height and width of 
the image by a factor of two. Except for the different scales in coordinates, the image 
scaled up by bilinear interpolation and the original image printed in Section 14.3 look the 
same. 


d21.set_figsize() 

print(’input image shape:', img.permute(1, 2, ).shape) 
d21.plt.imshow(img.permute(1, 2, @)); 

print(’output image shape:’, out_img.shape) 
d21.plt.imshow(out_img) ; 


input image shape: torch.Size([561, 728, 3]) 
output image shape: torch.Size([1122, 1456, 3]) 


0 500 1000 


In a fully convolutional network, we initialize the transposed convolutional layer with up- 
sampling of bilinear interpolation. For the 1 x 1 convolutional layer, we use Xavier initial- 
ization. 


W = bilinear_kernel(num_classes, num_classes, 64) 
net. transpose_conv.weight.data.copy_(W) ; 


14.11.3 Reading the Dataset 


We read the semantic segmentation dataset as introduced in Section 14.9. The output image 
shape of random cropping is specified as 320 x 480: both the height and width are divisible 
by 32. 


batch_size, crop_size 
train_iter, test_iter 


32, (320, 480) 
d21.load_data_voc(batch_size, crop_size) 


ood 
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read 1114 examples 
read 1078 examples 


14.11.4 Training 


Now we can train our constructed fully convolutional network. The loss function and ac- 
curacy calculation here are not essentially different from those in image classification of 
earlier chapters. Because we use the output channel of the transposed convolutional layer 
to predict the class for each pixel, the channel dimension is specified in the loss calculation. 
In addition, the accuracy is calculated based on correctness of the predicted class for all the 
pixels. 


def loss(inputs, targets): 
return F.cross_entropy(inputs, targets, reduction='none’').mean(1).mean(1) 


num_epochs, 1r, wd, devices = 5, @.001, le-3, d21.try_all_gpus() 
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd) 
d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.449, train acc 0.861, test acc 0.852 
226.7 examples/sec on [device(type='cuda’, index=0), device(type='cuda’,. 
<index=1)] 


a TS ee 


0.8 eee 


— train loss 
0.24 --- train acc 
—-- test acc 
0.0 t t t 
1 2 3 4 5 
epoch 


14.11.5 Prediction 


When predicting, we need to standardize the input image in each channel and transform the 
image into the four-dimensional input format required by the CNN. 


def predict(img): 
X = test_iter.dataset.normalize_image(img) . unsqueeze(Q) 
pred = net(X. to(devices[@])).argmax(dim=1) 
return pred.reshape(pred.shape[1], pred.shape[2]) 


To visualize the predicted class of each pixel, we map the predicted class back to its label 
color in the dataset. 
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def label2image(pred): 
colormap = torch.tensor(d21.VOC_COLORMAP, device=devices[0]) 
X = pred. long() 
return colormapLX, :] 


Images in the test dataset vary in size and shape. Since the model uses a transposed con- 
volutional layer with stride of 32, when the height or width of an input image is indivisible 
by 32, the output height or width of the transposed convolutional layer will deviate from 
the shape of the input image. In order to address this issue, we can crop multiple rectangu- 
lar areas with height and width that are integer multiples of 32 in the image, and perform 
forward propagation on the pixels in these areas separately. Note that the union of these 
rectangular areas needs to completely cover the input image. When a pixel is covered by 
multiple rectangular areas, the average of the transposed convolution outputs in separate 
areas for this same pixel can be input to the softmax operation to predict the class. 


For simplicity, we only read a few larger test images, and crop a 320X480 area for prediction 
starting from the upper-left corner of an image. For these test images, we print their cropped 
areas, prediction results, and ground-truth row by row. 


voc_dir = d21.download_extract(’voc2012', 'VOCdevkit/VOC2012') 
test_images, test_labels = d21.read_voc_images(voc_dir, False) 
n, imgs = 4, [] 
for i in range(n): 

crop_rect = (0, @, 320, 480) 

X = torchvision.transforms.functional.crop(test_images[i], *crop_rect) 

pred = label2image(predict(X)) 

imgs += [X.permute(1,2,0), pred.cpu(), 

torchvision.transforms.functional.crop( 
test_labels[i], xcrop_rect).permute(1,2,0)] 

d21.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2); 
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14.11.6 Summary 


e The fully convolutional network first uses a CNN to extract image features, then trans- 
forms the number of channels into the number of classes via a 1 x 1 convolutional 
layer, and finally transforms the height and width of the feature maps to those of the 
input image via the transposed convolution. 


e In a fully convolutional network, we can use upsampling of bilinear interpolation to 
initialize the transposed convolutional layer. 


14.11.7 Exercises 


1. If we use Xavier initialization for the transposed convolutional layer in the experiment, 
how does the result change? 


2. Can you further improve the accuracy of the model by tuning the hyperparameters? 
3. Predict the classes of all pixels in test images. 


4. The original fully convolutional network paper also uses outputs of some intermediate 
CNN layers (Long et al., 2015). Try to implement this idea. 


Discussions 22°. 


14.12 Neural Style Transfer 
| 


If you are a photography enthusiast, you may be familiar with the filter. It can change 
the color style of photos so that landscape photos become sharper or portrait photos have 
whitened skins. However, one filter usually only changes one aspect of the photo. To apply 
an ideal style to a photo, you probably need to try many different filter combinations. This 
process is as complex as tuning the hyperparameters of a model. 


In this section, we will leverage layerwise representations of a CNN to automatically apply 
the style of one image to another image, i.e., style transfer (Gatys et al., 2016). This task 
needs two input images: one is the content image and the other is the style image. We will 
use neural networks to modify the content image to make it close to the style image in style. 
For example, the content image in Fig. 14.12.1 is a landscape photo taken by us in Mount 
Rainier National Park in the suburbs of Seattle, while the style image is an oil painting with 
the theme of autumn oak trees. In the output synthesized image, the oil brush strokes of 
the style image are applied, leading to more vivid colors, while preserving the main shape 
of the objects in the content image. 


14.12.1 Method 


Fig. 14.12.2 illustrates the CNN-based style transfer method with a simplified example. 
First, we initialize the synthesized image, for example, into the content image. This syn- 
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Content image Synthesized image 


Style image 


isfy A Given content and style images, style transfer outputs a synthesized image. 


thesized image is the only variable that needs to be updated during the style transfer process, 
i.e., the model parameters to be updated during training. Then we choose a pretrained CNN 
to extract image features and freeze its model parameters during training. This deep CNN 
uses multiple layers to extract hierarchical features for images. We can choose the output 
of some of these layers as content features or style features. Take Fig. 14.12.2 as an exam- 
ple. The pretrained neural network here has 3 convolutional layers, where the second layer 
outputs the content features, and the first and third layers output the style features. 


Style loss 


le zO- Conv 


Style loss 


Content 
image 


Synthesized 
image 


! 
b Total variation loss 


isfy J!) CNN-based style transfer process. Solid lines show the direction of forward propagation 
and dotted lines show backward propagation. 


Next, we calculate the loss function of style transfer through forward propagation (direc- 
tion of solid arrows), and update the model parameters (the synthesized image for output) 
through backpropagation (direction of dashed arrows). The loss function commonly used 
in style transfer consists of three parts: (i) content loss makes the synthesized image and 
the content image close in content features; (ii) style loss makes the synthesized image and 
style image close in style features; and (iii) total variation loss helps to reduce the noise 
in the synthesized image. Finally, when the model training is over, we output the model 
parameters of the style transfer to generate the final synthesized image. 


In the following, we will explain the technical details of style transfer via a concrete exper- 
iment. 
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14.12.2 Reading the Content and Style Images 


First, we read the content and style images. From their printed coordinate axes, we can tell 
that these images have different sizes. 


zmatplotlib inline 

import torch 

import torchvision 

from torch import nn 

from d21 import torch as d21 


d21.set_figsize() 
content_img = d21.Image.open(’../img/rainier. jpg’) 
d21.plt.imshow(content_img) ; 


0 500 1000 1500 2000 


style_img = d21.Image.open(’../img/autumn-oak. jpg’) 
d21.plt.imshow(style_img) ; 


0 500 1000 1500 


14.12.3 Preprocessing and Postprocessing 


Below, we define two functions for preprocessing and postprocessing images. The pre- 
process function standardizes each of the three RGB channels of the input image and 
transforms the results into the CNN input format. The postprocess function restores the 
pixel values in the output image to their original values before standardization. Since the 
image printing function requires that each pixel has a floating point value from 0 to 1, we 
replace any value smaller than 0 or greater than 1 with 0 or 1, respectively. 
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rgb_mean = torch.tensor(L@.485, 0.456, @.406]) 
rgb_std = torch.tensor([@.229, 0.224, @.225]) 


def preprocess(img, image_shape) : 
transforms = torchvision. transforms .Compose(L 
torchvision. transforms.Resize(image_shape) , 
torchvision. transforms. ToTensor() , 
torchvision. transforms.Normalize(mean=rgb_mean, std=rgb_std)]) 
return transforms(img) .unsqueeze(Q) 


def postprocess(img): 
img = img[0].to(rgb_std. device) 
img = torch.clamp(img.permute(1, 2, 2) * rgb_std + rgb_mean, @, 1) 
return torchvision. transforms. ToPILImage() (img.permute(2, ð, 1)) 


14.12.4 Extracting Features 


We use the VGG-19 model pretrained on the ImageNet dataset to extract image features 
(Gatys et al., 2016). 


pretrained_net = torchvision.models.vgg19(pretrained=True) 


Downloading: "https: //download.pytorch.org/models/vgg19-dcbb9e9d.pth” to /home/ 
—ci/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d. pth 
100% | | 548M/548M [00:02<00:00, 213MB/s] 


In order to extract the content features and style features of the image, we can select the 
output of certain layers in the VGG network. Generally speaking, the closer to the input 
layer, the easier to extract details of the image, and vice versa, the easier to extract the 
global information of the image. In order to avoid excessively retaining the details of the 
content image in the synthesized image, we choose a VGG layer that is closer to the output 
as the content layer to output the content features of the image. We also select the output 
of different VGG layers for extracting local and global style features. These layers are also 
called style layers. As mentioned in Section 8.2, the VGG network uses 5 convolutional 
blocks. In the experiment, we choose the last convolutional layer of the fourth convolutional 
block as the content layer, and the first convolutional layer of each convolutional block as 
the style layer. The indices of these layers can be obtained by printing the pretrained_net 
instance. 


style_layers, content_layers = [@, 5, 10, 19, 28], [25] 


When extracting features using VGG layers, we only need to use all those from the input 
layer to the content layer or style layer that is closest to the output layer. Let’s construct 
a new network instance net, which only retains all the VGG layers to be used for feature 
extraction. 
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net = nn.Sequential(*[pretrained_net.features[i] for i in 
range(max(content_layers + style_layers) + 1)]) 


Given the input X, if we simply invoke the forward propagation net (X), we can only get 
the output of the last layer. Since we also need the outputs of intermediate layers, we need 
to perform layer-by-layer computation and keep the content and style layer outputs. 


def extract_features(X, content_layers, style_layers): 
contents = [] 
styles = [] 
for i in range(len(net)): 
X = net[i](X) 
if i in style_layers: 
styles. append(X) 
if i in content_layers: 
contents. append(X) 
return contents, styles 


Two functions are defined below: the get_contents function extracts content features 
from the content image, and the get_styles function extracts style features from the style 
image. Since there is no need to update the model parameters of the pretrained VGG during 
training, we can extract the content and the style features even before the training starts. 
Since the synthesized image is a set of model parameters to be updated for style transfer, 
we can only extract the content and style features of the synthesized image by calling the 
extract_features function during training. 


def get_contents(image_shape, device): 
content_X = preprocess(content_img, image_shape) . to(device) 
contents_Y, _ = extract_features(content_X, content_layers, style_layers) 
return content_X, contents_Y 


def get_styles(image_shape, device): 
style_X = preprocess(style_img, image_shape) . to(device) 
_, styles_Y = extract_features(style_X, content_layers, style_layers) 
return style_X, styles_Y 


14.12.5 Defining the Loss Function 


Now we will describe the loss function for style transfer. The loss function consists of the 
content loss, style loss, and total variation loss. 


Content Loss 


Similar to the loss function in linear regression, the content loss measures the difference in 
content features between the synthesized image and the content image via the squared loss 
function. The two inputs of the squared loss function are both outputs of the content layer 
computed by the extract_features function. 
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def content_loss(Y_hat, Y): 
# We detach the target content from the tree used to dynamically compute 
# the gradient: this is a stated value, not a variable. Otherwise the loss 
# will throw an error. 
return torch.square(Y_hat - Y.detach()).mean() 


Style Loss 


Style loss, similar to content loss, also uses the squared loss function to measure the dif- 
ference in style between the synthesized image and the style image. To express the style 
output of any style layer, we first use the extract_features function to compute the style 
layer output. Suppose that the output has | example, c channels, height h, and width w, we 
can transform this output into matrix X with c rows and hw columns. This matrix can be 
thought of as the concatenation of c vectors X1, ..., Xc, each of which has a length of hw. 
Here, vector x; represents the style feature of channel i. 


In the Gram matrix of these vectors XX" € R°*°, element x;; in row i and column j is 
the dot product of vectors x; and x;. It represents the correlation of the style features of 
channels i and j. We use this Gram matrix to represent the style output of any style layer. 
Note that when the value of hw is larger, it likely leads to larger values in the Gram matrix. 
Note also that the height and width of the Gram matrix are both the number of channels c. 
To allow style loss not to be affected by these values, the gram function below divides the 
Gram matrix by the number of its elements, i.e., chw. 


def gram(X): 
num_channels, n = X.shape[1], X.numel() // X.shape[1] 
X = X.reshape((num_channels, n)) 
return torch.matmul(X, X.T) / (num_channels * n) 


Obviously, the two Gram matrix inputs of the squared loss function for style loss are based 
on the style layer outputs for the synthesized image and the style image. It is assumed here 
that the Gram matrix gram_Y based on the style image has been precomputed. 


def style_loss(Y_hat, gram_Y): 
return torch.square(gram(Y_hat) - gram_Y.detach()).mean() 


Total Variation Loss 


Sometimes, the learned synthesized image has a lot of high-frequency noise, i.e., particu- 
larly bright or dark pixels. One common noise reduction method is total variation denois- 
ing. Denote by x;,; the pixel value at coordinate (i, j). Reducing total variation loss 


brig = teeta] + beis — xij (14.12.1) 


ij 


makes values of neighboring pixels on the synthesized image closer. 
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def tv_loss(Y_hat): 
nein Oo A(torchrabs OY linens, $, eg, 31) = Wilnendle, 2, ei, Ean OE 
HortelclosC_lnenelle, 25 85 lel] = Wolienle, 8, 8, 2D) iireemo)) 


Loss Function 


The loss function of style transfer is the weighted sum of content loss, style loss, and total 
variation loss. By adjusting these weight hyperparameters, we can balance among content 
retention, style transfer, and noise reduction on the synthesized image. 


content_weight, style_weight, tv_weight = 1, 1e4, 10 


def compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram): 

# Calculate the content, style, and total variance losses respectively 

contents_l = [content_loss(Y_hat, Y) * content_weight for Y_hat, Y in zip( 
contents_Y_hat, contents_Y) ] 

styles_l = [style_loss(Y_hat, Y) * style_weight for Y_hat, Y in zip( 
styles_Y_hat, styles_Y_gram) ] 

tv_l = tv_loss(X) x tv_weight 

# Add up all the losses 

1 = sum(styles_l + contents_1l + [tv_1]) 

return contents_l, styles_l, tv_l, 1 


14.12.6 Initializing the Synthesized Image 


In style transfer, the synthesized image is the only variable that needs to be updated during 
training. Thus, we can define a simple model, SynthesizedImage, and treat the synthe- 
sized image as the model parameters. In this model, forward propagation just returns the 
model parameters. 


class SynthesizedImage(nn.Module) : 
def __init__(self, img_shape, **kwargs): 
super (SynthesizedImage, self).__init__(**kwargs) 
self.weight = nn.Parameter(torch. rand(*img_shape) ) 


def forward(self): 
return self.weight 


Next, we define the get_inits function. This function creates a synthesized image model 
instance and initializes it to the image X. Gram matrices for the style image at various style 
layers, styles_Y_gram, are computed prior to training. 


def get_inits(X, device, lr, styles_Y): 
gen_img = SynthesizedImage(X.shape) . to(device) 
gen_img.weight.data.copy_(X.data) 
trainer = torch.optim.Adam(gen_img.parameters(), lr=1r) 
styles_Y_gram = [gram(Y) for Y in styles_Y] 
return gen_img(), styles_Y_gram, trainer 
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14.12.7 Training 


When training the model for style transfer, we continuously extract content features and 
style features of the synthesized image, and calculate the loss function. Below defines the 
training loop. 


def train(X, contents_Y, styles_Y, device, lr, num_epochs, 1r_decay_epoch): 
X, styles_Y_gram, trainer = get_inits(X, device, lr, styles_Y) 
scheduler = torch.optim.1r_scheduler.StepLR(trainer, lr_decay_epoch, 92.8) 
animator = d21.Animator(xlabel='epoch’, ylabel='loss’, 
xlim=[10, num_epochs], 
legend=['content', ‘style’, 'TV’], 
ncols=2, figsize=(7, 2.5)) 
for epoch in range(num_epochs) : 
trainer.zero_grad() 
contents_Y_hat, styles_Y_hat = extract_features( 
X, content_layers, style_layers) 
contents_l, styles_l, tv_l, 1 = compute_loss( 
X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram) 
1. backward() 
trainer.step() 
scheduler. step() 
if (epoch + 1) % 10 == 
animator .axes[1].imshow(postprocess(X)) 
animator.add(epoch + 1, [float(sum(contents_l)), 
float(sum(styles_l)), float(tv_1)]) 
return X 


Now we start to train the model. We rescale the height and width of the content and style 
images to 300 by 450 pixels. We use the content image to initialize the synthesized im- 
age. 


device, image_shape = d2l1.try_gpu(), (300, 450) # PIL Image (h, w) 
net = net.to(device) 

content_X, contents_Y = get_contents(image_shape, device) 

_, styles_Y = get_styles(image_shape, device) 

output = train(content_X, contents_Y, styles_Y, device, 0.3, 500, 50) 


— content 
15 --- style 


—- TV 


0 100 200 300 400 


100 200 300 400 500 
epoch 


We can see that the synthesized image retains the scenery and objects of the content image, 
and transfers the color of the style image at the same time. For example, the synthesized 
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image has blocks of color like those in the style image. Some of these blocks even have the 
subtle texture of brush strokes. 


14.12.8 Summary 


e The loss function commonly used in style transfer consists of three parts: (i) content loss 
makes the synthesized image and the content image close in content features; (ii) style 
loss makes the synthesized image and style image close in style features; and (iii) total 
variation loss helps to reduce the noise in the synthesized image. 


e We can use a pretrained CNN to extract image features and minimize the loss function 
to continuously update the synthesized image as model parameters during training. 


e We use Gram matrices to represent the style outputs from the style layers. 


14.12.9 Exercises 


1. How does the output change when you select different content and style layers? 


2. Adjust the weight hyperparameters in the loss function. Does the output retain more 
content or have less noise? 


3. Use different content and style images. Can you create more interesting synthesized 
images? 


4. Can we apply style transfer for text? Hint: you may refer to the survey paper by Hu et 
al. (2022). 


Discussions 22+. 


14.13 Image Classification (CIFAR-10) on Kaggle 
———— 


So far, we have been using high-level APIs of deep learning frameworks to directly obtain 
image datasets in tensor format. However, custom image datasets often come in the form 
of image files. In this section, we will start from raw image files, and organize, read, then 
transform them into tensor format step by step. 


We experimented with the CIFAR- 10 dataset in Section 14.1, which is an important dataset 
in computer vision. In this section, we will apply the knowledge we learned in previous 
sections to practice the Kaggle competition of CIFAR-10 image classification. The web 
address of the competition is https://www.kaggle.com/c/cifar- 10 


Fig. 14.13.1 shows the information on the competition’s webpage. In order to submit the 
results, you need to register a Kaggle account. 
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| J -aN CIFAR-10 - Object Recognition in Images 


Sais wish Identify the eae of 60,000 labeled images 
nD ke & 31 te 


Overview Data Discussion Leaderboard Rules 


Overview 

Description CIFAR-10 is an established computer-vision dataset used for object recognition. It is a subset of the 80 
F million tiny images dataset and consists of 60,000 32x32 color images containing one of 10 object 

Evaluation 


classes, with 6000 images per class. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey 
Hinton. 


T CIFAR-10 image classification competition webpage information. The competition 


dataset can be obtained by clicking the “Data” tab. 


import collections 

import math 

import os 

import shutil 

import pandas as pd 

import torch 

import torchvision 

from torch import nn 

from d21 import torch as d21 


14.13.1 Obtaining and Organizing the Dataset 


The competition dataset is divided into a training set and a test set, which contain 50000 
and 300000 images, respectively. In the test set, 10000 images will be used for evaluation, 
while the remaining 290000 images will not be evaluated: they are included just to make 
it hard to cheat with manually labeled results of the test set. The images in this dataset 
are all png color (RGB channels) image files, whose height and width are both 32 pixels. 
The images cover a total of 10 categories, namely airplanes, cars, birds, cats, deer, dogs, 
frogs, horses, boats, and trucks. The upper-left corner of Fig. 14.13.1 shows some images 
of airplanes, cars, and birds in the dataset. 


Downloading the Dataset 


After logging in to Kaggle, we can click the “Data” tab on the CIFAR-10 image classifi- 
cation competition webpage shown in Fig. 14.13.1 and download the dataset by clicking 
the “Download All” button. After unzipping the downloaded file in ../data, and un- 
zipping train. 7z and test. 7z inside it, you will find the entire dataset in the following 
paths: 


e ../data/cifar-10/train/[1-50000]. png 


e ../data/cifar-10/test/[1-300000] . png 
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e ../data/cifar-10/trainLabels.csv 
e ../data/cifar-10/sampleSubmission.csv 


where the train and test directories contain the training and testing images, respectively, 
trainLabels.csv provides labels for the training images, and sample_submission.csv 
is a sample submission file. 


To make it easier to get started, we provide a small-scale sample of the dataset that contains 
the first 1000 training images and 5 random testing images. To use the full dataset of the 
Kaggle competition, you need to set the following demo variable to False. 


#@save 
d21.DATA_HUB['cifar1@_tiny'’] = (d21.DATA_URL + 'kaggle_cifar1@_tiny.zip’, 
"2068874e4b9a9 FOF bO7ebeQad2b29754449ccacd’) 


# If you use the full dataset downloaded for the Kaggle competition, set 
# ‘demo* to False 
demo = True 


if demo: 

data_dir = d21.download_extract('cifarlQ_tiny') 
else: 

data_dir = '../data/cifar-10/' 


Downloading ../data/kaggle_cifar10_tiny.zip from http://d21-data.s3-accelerate. 
samazonaws.com/kaggle_cifar10_tiny.zip... 


Organizing the Dataset 


We need to organize datasets to facilitate model training and testing. Let’s first read the 
labels from the csv file. The following function returns a dictionary that maps the non- 
extension part of the filename to its label. 


#@save 
def read_csv_labels(fname) : 
"""Read ‘fname’ to return a filename to label dictionary. 
with open(fname, 'r’) as f: 
# Skip the file header line (column name) 
lines = f.readlines()[1:] 
tokens = [l.rstrip().split('’,') for 1 in lines] 
return dict(((name, label) for name, label in tokens)) 


nnn 


labels = read_csv_labels(os.path.join(data_dir, 'trainLabels.csv')) 
print(’# training examples:', len(labels)) 
print(’# classes:', len(set(labels.values()))) 


# training examples: 1000 
# classes: 10 
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Next, we define the reorg_train_valid function to split the validation set out of the orig- 
inal training set. The argument valid_ratio in this function is the ratio of the number 
of examples in the validation set to the number of examples in the original training set. 
More concretely, let n be the number of images of the class with the least examples, and 
r be the ratio. The validation set will split out max(|r], 1) images for each class. Let’s 
use valid_ratio=0.1 as an example. Since the original training set has 50000 images, 
there will be 45000 images used for training in the path train_valid_test/train, while 
the other 5000 images will be split out as validation set in the path train_valid_test/ 
valid. After organizing the dataset, images of the same class will be placed under the same 
folder. 


#@save 

def copyfile(filename, target_dir): 
"""Copy a file into a target directory. 
os.makedirs(target_dir, exist_ok=True) 
shutil.copy(filename, target_dir) 


nnn 


#@save 
def reorg_train_valid(data_dir, labels, valid_ratio): 
"""Split the validation set out of the original training set. 
# The number of examples of the class that has the fewest examples in the 
# training dataset 
n = collections.Counter(labels.values()).most_common()[-1][1] 
# The number of examples per class for the validation set 
n_valid_per_label = max(1, math.floor(n * valid_ratio)) 
label_count = {} 
for train_file in os.listdir(os.path.join(data_dir, ‘train’)): 
label = labels[train_file.split(’.')([0]] 
fname = os.path.join(data_dir, ‘train’, train_file) 
copyfile(fname, os.path.join(data_dir, 'train_valid_test’, 
"train_valid', label)) 
if label not in label_count or label_countLlabel] < n_valid_per_label: 
copyfile(fname, os.path.join(data_dir, 'train_valid_test’, 
‘valid’, label)) 
label_count[label] = label_count.get(label, ð) + 1 
else: 
copyfile(fname, os.path.join(data_dir, 'train_valid_test’, 
‘train’, label)) 


nnn 


return n_valid_per_label 


The reorg_test function below organizes the testing set for data loading during predic- 
tion. 


#@save 
def reorg_test(data_dir): 

"""Organize the testing set for data loading during prediction. 
for test_file in os.listdir(os.path.join(data_dir, ‘test’)): 
copyfile(os.path.join(data_dir, ‘test’, test_file), 
os.path.join(data_dir, ‘train_valid_test', ‘test’, 

"unknown’')) 


non 
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Finally, we use a function to invoke the read_csv_labels, reorg_train_valid, and re- 
org_test functions defined above. 


def reorg_cifarl0_data(data_dir, valid_ratio): 
labels = read_csv_labels(os.path.join(data_dir, 'trainLabels.csv’)) 
reorg_train_valid(data_dir, labels, valid_ratio) 
reorg_test(data_dir) 


Here we only set the batch size to 32 for the small-scale sample of the dataset. When 
training and testing the complete dataset of the Kaggle competition, batch_size should 
be set to a larger integer, such as 128. We split out 10% of the training examples as the 
validation set for tuning hyperparameters. 


batch_size = 32 if demo else 128 
valid_ratio = 0.1 
reorg_cifar1Q_data(data_dir, valid_ratio) 


14.13.2 Image Augmentation 


We use image augmentation to address overfitting. For example, images can be flipped hor- 
izontally at random during training. We can also perform standardization for the three RGB 
channels of color images. Below lists some of these operations that you can tweak. 


transform_train = torchvision. transforms.Compose([ 
# Scale the image up to a square of 40 pixels in both height and width 
torchvision. transforms .Resize(4Q) , 
# Randomly crop a square image of 40 pixels in both height and width to 
# produce a small square of 0.64 to 1 times the area of the original 
# image, and then scale it to a square of 32 pixels in both height and 
# width 
torchvision. transforms .RandomResizedCrop(32, scale=(0.64, 1.0), 

ratio=(1.0, 1.0)), 
torchvision. transforms .RandomHorizontalFlip(), 
torchvision. transforms. ToTensor(), 
# Standardize each channel of the image 
torchvision. transforms .Normalize(l0.4914, 0.4822, @.4465], 
[0.2023, 0.1994, 0.2010])]) 


During testing, we only perform standardization on images so as to remove randomness in 
the evaluation results. 


transform_test = torchvision. transforms .Compose(L 
torchvision. transforms. ToTensor(), 
torchvision. transforms .Normalize(L0.4914, 2.4822, 0.4465], 
[0.2023, 0.1994, 0.2010])]) 


14.13.3 Reading the Dataset 


Next, we read the organized dataset consisting of raw image files. Each example includes 
an image and a label. 
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train_ds, train_valid_ds = [torchvision.datasets.ImageFolder( 
os.path.join(data_dir, ‘train_valid_test', folder), 
transform=transform_train) for folder in ['train’, 'train_valid’]] 


valid_ds, test_ds = [torchvision.datasets. ImageFolder( 
os.path.join(data_dir, 'train_valid_test', folder), 
transform=transform_test) for folder in ['valid’, ‘test']] 


During training, we need to specify all the image augmentation operations defined above. 
When the validation set is used for model evaluation during hyperparameter tuning, no 
randomness from image augmentation should be introduced. Before final prediction, we 
train the model on the combined training set and validation set to make full use of all the 
labeled data. 


train_iter, train_valid_iter = [torch.utils.data.DataLoader( 
dataset, batch_size, shuffle=True, drop_last=True) 
for dataset in (train_ds, train_valid_ds)] 


valid_iter = torch.utils.data.DataLoader(valid_ds, batch_size, shuffle=False, 
drop_last=True) 


test_iter = torch.utils.data.DataLoader(test_ds, batch_size, shuffle=False, 
drop_last=False) 


14.13.4 Defining the Model 
We define the ResNet-18 model described in Section 8.6. 


def get_net(): 
num_classes = 10 
net = d2l1.resnet18(num_classes, 3) 
return net 


loss = nn.CrossEntropyLoss(reduction="none") 


14.13.5 Defining the Training Function 


We will select models and tune hyperparameters according to the model’s performance on 
the validation set. In the following, we define the model training function train. 


def train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, 1lr_period, 
lr_decay): 
trainer = torch.optim.SGD(net.parameters(), lr=lr, momentum=0.9, 
weight_decay=wd) 
scheduler = torch.optim.1r_scheduler.StepLR(trainer, lr_period, 1r_decay) 
num_batches, timer = len(train_iter), d21.Timer() 
legend = [’train loss’, ‘train acc’] 
if valid_iter is not None: 
legend.append('valid acc’) 


(continues on next page) 
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(continued from previous page) 


animator = d21.Animator(xlabel='epoch'’, xlim=[1, num_epochs], 
legend=legend) 
net = nn.DataParallel(net, device_ids=devices) .to(devices[0@]) 
for epoch in range(num_epochs) : 
net.train() 
metric = d21.Accumulator(3) 
for i, (features, labels) in enumerate(train_iter): 
timer.start() 
l, acc = d21.train_batch_ch13(net, features, labels, 
loss, trainer, devices) 
metric.add(1, acc, labels. shape[0]) 
timer.stop() 
if (i + 1) % (num_batches // 5) == @ or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(metricl[@] / metric[2], metricLl1] / metric[2], 
None) ) 
if valid_iter is not None: 
valid_ace = d21.evaluate_accuracy_gpu(net, valid_iter) 
animator.add(epoch + 1, (None, None, valid_acc)) 
scheduler. step() 
measures = (f'train loss {metric[@] / metric[2]: .3f}, 
f'train acc {metric[1] / metric[2]:.3f}') 
if valid_iter is not None: 
measures += f', valid acc {valid_acc: .3f}' 
print(measures + f'\n{metricl2] * num_epochs / timer.sum(): .1f} 
f’ examples/sec on {str(devices) }’) 


1 


14.13.6 Training and Validating the Model 


Now, we can train and validate the model. All the following hyperparameters can be tuned. 
For example, we can increase the number of epochs. When 1r_period and 1r_decay 
are set to 4 and 0.9, respectively, the learning rate of the optimization algorithm will be 
multiplied by 0.9 after every 4 epochs. Just for ease of demonstration, we only train 20 
epochs here. 


devices, num_epochs, lr, wd = d2l.try_all_gpus(), 20, 2e-4, 5e-4 

lr_period, lr_decay, net = 4, 0.9, get_net() 

net (next(iter(train_iter))[0]) 

train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, 1lr_period, 
lIr_decay) 


train loss 0.654, train acc 0.789, valid acc 0.438 
958.1 examples/sec on [device(type='cuda’, index=0), device(type='cuda’,. 
<index=1)] 


14.13.7 Classifying the Testing Set and Submitting Results on Kaggle 


After obtaining a promising model with hyperparameters, we use all the labeled data (in- 
cluding the validation set) to retrain the model and classify the testing set. 
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net, preds = get_net(), [] 

net (next (iter(train_valid_iter))(0]) 

train(net, train_valid_iter, None, num_epochs, lr, wd, devices, lr_period, 
lr_decay) 


for X, _ in test_iter: 
y_hat = net(X.to(devices[0])) 
preds.extend(y_hat.argmax(dim=1).type(torch. int32).cpu() .numpy()) 
sorted_ids = list(range(1, len(test_ds) + 1)) 
sorted_ids.sort(key=lambda x: str(x)) 
df = pd.DataFrame({'id': sorted_ids, ‘label’: preds}) 
df[’label'] = df['label’].apply(lambda x: train_valid_ds.classes[x]) 
df.to_csv('submission.csv', index=False) 


train loss 0.608, train acc 0.786 
1040.8 examples/sec on [device(type='cuda’, index=0), device(type='cuda',. 
<index=1)] 


— train loss 
--- train acc 


2.04 
1:53 
1.04 
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The above code will generate a submission. csv file, whose format meets the requirement 
of the Kaggle competition. The method for submitting results to Kaggle is similar to that 
in Section 5.7. 


14.13.8 Summary 


e We can read datasets containing raw image files after organizing them into the required 
format. 
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e We can use convolutional neural networks and image augmentation in an image classi- 
fication competition. 


14.13.9 Exercises 


1. Use the complete CIFAR-10 dataset for this Kaggle competition. Set hyperparameters 
as batch_size = 128, num_epochs = 100, 1r = @.1,1r_period = 50, and Ir_decay 
= Q.1. See what accuracy and ranking you can achieve in this competition. Can you 
further improve them? 


2. What accuracy can you get when not using image augmentation? 


Discussions 22°. 


14.14 Dog Breed Identification (ImageNet Dogs) on 
Kaggle 


In this section, we will practice the dog breed identification problem on Kaggle. The web 
address of this competition is https://www.kaggle.com/c/dog- breed-identification 


In this competition, 120 different breeds of dogs will be recognized. In fact, the dataset for 
this competition is a subset of the ImageNet dataset. Unlike the images in the CIFAR-10 
dataset in Section 14.13, the images in the ImageNet dataset are both higher and wider in 
varying dimensions. Fig. 14.14.1 shows the information on the competition’s webpage. 
You need a Kaggle account to submit your results. 


import os 

import torch 

import torchvision 

from torch import nn 

from d21 import torch as d21 


14.14.1 Obtaining and Organizing the Dataset 


The competition dataset is divided into a training set and a test set, which contain 10222 
and 10357 JPEG images of three RGB (color) channels, respectively. Among the training 
dataset, there are 120 breeds of dogs such as Labradors, Poodles, Dachshunds, Samoyeds, 
Huskies, Chihuahuas, and Yorkshire Terriers. 


Downloading the Dataset 


After logging into Kaggle, you can click on the “Data” tab on the competition webpage 
shown in Fig. 14.14.1 and download the dataset by clicking the “Download All” button. 
After unzipping the downloaded file in ../data, you will find the entire dataset in the 
following paths: 
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Playground Prediction Competition 


Dog Breed Identification 
Determine the breed of adog in an image 


Kaggle - 1,286 teams - 4 months ago 


Overview Data Kernels Discussion Leaderboard Rules 


Overview 


Description Who's a good dog? Who likes ear scratches? Well, it seems those fancy deep neural networks don't have 
all the answers. However, maybe they can answer that ubiquitous question we all ask when meeting a 


Evaluation four-legged stranger: what kind of good pup is that? 


In this playground competition, you are provided a strictly canine subset of ImageNet in order to practice 


fine-grained image categorization. How well you can tell your Norfolk Terriers from your Norwich 
Terriers? With 120 breeds of dogs and a limited number training images per class, you might find the 


problem more, err, ruff than you anticipated. 


AS) bk i 


iste ZZ!) The dog breed identification competition website. The competition dataset can be 
obtained by clicking the “Data” tab. 


e ../data/dog-breed-identification/labels.csv 

e ../data/dog-breed-identification/sample_submission.csv 
e ../data/dog-breed-identification/train 

e ../data/dog-breed-identification/test 


You may have noticed that the above structure is similar to that of the CIFAR-10 competition 
in Section 14.13, where folders train/ and test/ contain training and testing dog images, 
respectively, and labels.csv contains the labels for the training images. Similarly, to 
make it easier to get started, we provide a small sample of the dataset mentioned above: 
train_valid_test_tiny.zip. If you are going to use the full dataset for the Kaggle 
competition, you need to change the demo variable below to False. 


#@save 
d21.DATA_HUB['dog_tiny’] = (d21.DATA_URL + ‘'kaggle_dog_tiny.zip', 
"@cb91d09b814ecdc07b50f31f8dcad3e81d6a86d') 


# If you use the full dataset downloaded for the Kaggle competition, change 
# the variable below to ‘False* 
demo = True 
if demo: 
data_dir = d21.download_extract('dog_tiny') 


(continues on next page) 
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(continued from previous page) 


else: 
data_dir = os.path.join(’..', ‘data’, '‘dog-breed-identification’) 


Downloading ../data/kaggle_dog_tiny.zip from http://d21-data.s3-accelerate. 
~amazonaws.com/kaggle_dog_tiny.zip... 


Organizing the Dataset 


We can organize the dataset similarly to what we did in Section 14.13, namely splitting out 
a validation set from the original training set, and moving images into subfolders grouped 
by labels. 


The reorg_dog_data function below reads the training data labels, splits out the validation 
set, and organizes the training set. 


def reorg_dog_data(data_dir, valid_ratio): 
labels = d21.read_csv_labels(os.path.join(data_dir, 'labels.csv')) 
d21.reorg_train_valid(data_dir, labels, valid_ratio) 
d21.reorg_test(data_dir) 


batch_size = 32 if demo else 128 
valid_ratio = Q.1 
reorg_dog_data(data_dir, valid_ratio) 


14.14.2 Image Augmentation 


Recall that this dog breed dataset is a subset of the ImageNet dataset, whose images are 
larger than those of the CIFAR-10 dataset in Section 14.13. The following lists a few image 
augmentation operations that might be useful for relatively larger images. 


transform_train = torchvision.transforms.Compose([ 
# Randomly crop the image to obtain an image with an area of 0.08 to 1 of 
# the original area and height-to-width ratio between 3/4 and 4/3. Then, 
# scale the image to create a new 224 x 224 image 
torchvision. transforms.RandomResizedCrop(224, scale=(@.08, 1.0), 
ratio=(3.0/4.0, 4.0/3.0)), 

torchvision. transforms .RandomHorizontalFlip(), 
# Randomly change the brightness, contrast, and saturation 
torchvision. transforms.ColorJitter(brightness=0.4, 

contrast=0.4, 

saturation=0.4), 
# Add random noise 
torchvision. transforms. ToTensor(), 
# Standardize each channel of the image 
torchvision. transforms .Normalize([Q.485, 2.456, 0.406], 

[0.229, 0.224, @.225])]) 


During prediction, we only use image preprocessing operations without randomness. 
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transform_test = torchvision. transforms .Compose(L 
torchvision. transforms.Resize(256), 
# Crop a 224 x 224 square area from the center of the image 
torchvision. transforms .CenterCrop(224) , 
torchvision. transforms. ToTensor(), 
torchvision. transforms.Normalize([0.485, 0.456, 0.406], 
[0.229, 0.224, @.225])]) 


14.14.3 Reading the Dataset 


As in Section 14.13, we can read the organized dataset consisting of raw image files. 


train_ds, train_valid_ds = [torchvision.datasets.ImageFolder( 
os.path.join(data_dir, 'train_valid_test', folder), 
transform=transform_train) for folder in ['train’, 'train_valid’]] 


valid_ds, test_ds = [torchvision.datasets.ImageFolder( 
os.path.join(data_dir, 'train_valid_test', folder), 
transform=transform_test) for folder in ['valid’, ‘test']] 


Below we create data iterator instances the same way as in Section 14.13. 


train_iter, train_valid_iter = [torch.utils.data.DataLoader( 
dataset, batch_size, shuffle=True, drop_last=True) 
for dataset in (train_ds, train_valid_ds)] 


valid_iter = torch.utils.data.DataLoader(valid_ds, batch_size, shuffle=False, 
drop_last=True) 


test_iter = torch.utils.data.DataLoader(test_ds, batch_size, shuffle=False, 
drop_last=False) 


14.14.4 Fine-Tuning a Pretrained Model 


Again, the dataset for this competition is a subset of the ImageNet dataset. Therefore, we 
can use the approach discussed in Section 14.2 to select a model pretrained on the full 
ImageNet dataset and use it to extract image features to be fed into a custom small-scale 
output network. High-level APIs of deep learning frameworks provide a wide range of 
models pretrained on the ImageNet dataset. Here, we choose a pretrained ResNet-34 model, 
where we simply reuse the input of this model’s output layer (i.e., the extracted features). 
Then we can replace the original output layer with a small custom output network that can 
be trained, such as stacking two fully connected layers. Different from the experiment in 
Section 14.2, the following does not retrain the pretrained model used for feature extraction. 
This reduces training time and memory for storing gradients. 


Recall that we standardized images using the means and standard deviations of the three 
RGB channels for the full ImageNet dataset. In fact, this is also consistent with the stan- 
dardization operation by the pretrained model on ImageNet. 
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def get_net(devices): 

finetune_net = nn.Sequential() 

finetune_net.features = torchvision.models.resnet34(pretrained=True) 

# Define a new output network (there are 120 output categories) 

finetune_net.output_new = nn.Sequential(nn.Linear(1000, 256), 
nn.ReLU(), 
nn.Linear(256, 120)) 

# Move the model to devices 

finetune_net = finetune_net.to(devices[0]) 

# Freeze parameters of feature layers 

for param in finetune_net.features.parameters(): 

param.requires_grad = False 
return finetune_net 


Before calculating the loss, we first obtain the input of the pretrained model’s output layer, 
i.e., the extracted feature. Then we use this feature as input for our small custom output 
network to calculate the loss. 


loss = nn.CrossEntropyLoss(reduction='none’) 


def evaluate_loss(data_iter, net, devices): 

l_sum, n = 0.0, 0 

for features, labels in data_iter: 
features, labels = features.to(devices[0]), labels. to(devices[0]) 
outputs = net(features) 
1 = loss(outputs, labels) 
l_sum += 1.sum() 
n += labels.numel() 

return l_sum / n 


14.14.5 Defining the Training Function 


We will select the model and tune hyperparameters according to the model’s performance 
on the validation set. The model training function train only iterates parameters of the 
small custom output network. 


def train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, lr_period, 
lr_decay): 
# Only train the small custom output network 
net = nn.DataParallel(net, device_ids=devices) .to(devices[0@]) 
trainer = torch.optim.SGD((param for param in net.parameters() 
if param.requires_grad), lr=1r, 
momentum=0.9, weight_decay=wd) 

scheduler = torch.optim.lr_scheduler.StepLR(trainer, lr_period, 1lr_decay) 
num_batches, timer = len(train_iter), d21.Timer() 
legend = ['train loss’] 
if valid_iter is not None: 

legend. append('valid loss’) 
animator = d21.Animator(xlabel='epoch’, xlim=[1, num_epochs], 

legend=legend) 

for epoch in range(num_epochs) : 

metric = d21.Accumulator(2) 


(continues on next page) 


687 


Dog Breed Identification (ImageNet Dogs) on Kaggle 


(continued from previous page) 


for i, (features, labels) in enumerate(train_iter): 
timer.startQ) 
features, labels = features.to(devices[0]), labels.to(devices[Q]) 
trainer.zero_grad() 
output = net(features) 
1 = loss(output, labels) .sum() 
1. backward() 
trainer.step() 
metric.add(1, labels.shape[Q]) 
timer.stop() 
if (i + 1) % (num_batches // 5) == @ or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(metricl@] / metric[1], None)) 
measures = f'train loss {metriclQ] / metric[1]:.3f}’ 
if valid_iter is not None: 
valid_loss = evaluate_loss(valid_iter, net, devices) 
animator.add(epoch + 1, (None, valid_loss.detach().cpu())) 
scheduler .step() 
if valid_iter is not None: 
measures += f', valid loss {valid_loss: .3f}' 
print(measures + f'\n{metric[1] * num_epochs / timer.sum(): .1f } 
f’ examples/sec on {str(devices) }’) 


14.14.6 Training and Validating the Model 


Now we can train and validate the model. The following hyperparameters are all tunable. 
For example, the number of epochs can be increased. Because 1r_period and 1r_decay 
are set to 2 and 0.9, respectively, the learning rate of the optimization algorithm will be 
multiplied by 0.9 after every 2 epochs. 


devices, num_epochs, lr, wd = d21.try_all_gpus(), 10, le-4, le-4 

lr_period, lr_decay, net = 2, 0.9, get_net(devices) 

train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, lr_period, 
lr_decay) 


train loss 1.240, valid loss 1.545 
577.5 examples/sec on [device(type='cuda’, index=0), device(type='cuda’,. 
<index=1)] 


— train loss 
==- valid loss 


2 4 6 8 10 
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14.14.7 Classifying the Testing Set and Submitting Results on Kaggle 


Similar to the final step in Section 14.13, in the end all the labeled data (including the 
validation set) are used for training the model and classifying the testing set. We will use 
the trained custom output network for classification. 


net = get_net(devices) 
train(net, train_valid_iter, None, num_epochs, lr, wd, devices, lr_period, 
lr_decay) 


preds = [] 

for data, label in test_iter: 
output = torch.nn.functional.softmax(net(data.to(devices[0])), dim=1) 
preds.extend(output.cpu() .detach() .numpy()) 

ids = sorted(os.listdir( 
os.path.join(data_dir, ‘train_valid_test', ‘test’, ‘unknown’))) 


with open('submission.csv', 'w') as f: 
f.write(’id,'’ + ',’.join(train_valid_ds.classes) + '\n') 
for i, output in zip(ids, preds): 
f.write(i.split(’.'’)[0] + ’,’ + ',’.join¢ 


[str(num) for num in output]) + '\n') 


train loss 1.217 
742.7 examples/sec on [device(type='cuda’, index=0), device(type='cuda’,. 
<index=1)] 


— train loss 


i r 7 + 
2 4 6 8 10 
epoch 


The above code will generate a submission. csv file to be submitted to Kaggle in the same 
way described in Section 5.7. 


14.14.8 Summary 


e Images in the ImageNet dataset are larger (with varying dimensions) than CIFAR-10 im- 
ages. We may modify image augmentation operations for tasks on a different dataset. 


e To classify a subset of the ImageNet dataset, we can leverage pre-trained models on the 
full ImageNet dataset to extract features and only train a custom small-scale output 
network. This will lead to less computational time and memory cost. 


14.14.9 Exercises 
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1. When using the full Kaggle competition dataset, what results can you achieve when you 
increase batch_size (batch size) and num_epochs (number of epochs) while setting 
some other hyperparameters as lr = @.01, 1lr_period = 10, and lr_decay = 0.1? 


2. Do you get better results if you use a deeper pretrained model? How do you tune hyper- 
parameters? Can you further improve the results? 


Discussions 27°. 


15 
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Humans need to communicate. Out of this basic need of the human condition, a vast amount 
of written text has been generated on an everyday basis. Given rich text in social media, 
chat apps, emails, product reviews, news articles, research papers, and books, it becomes 
vital to enable computers to understand them to offer assistance or make decisions based 
on human languages. 


Natural language processing studies interactions between computers and humans using 
natural languages. In practice, it is very common to use natural language processing tech- 
niques to process and analyze text (human natural language) data, such as language models 
in Section 9.3 and machine translation models in Section 10.5. 


To understand text, we can begin by learning its representations. Leveraging the existing 
text sequences from large corpora, self-supervised learning has been extensively used to 
pretrain text representations, such as by predicting some hidden part of the text using some 
other part of their surrounding text. In this way, models learn through supervision from 
massive text data without expensive labeling efforts! 


As we will see in this chapter, when treating each word or subword as an individual token, 
the representation of each token can be pretrained using word2vec, GloVe, or subword 
embedding models on large corpora. After pretraining, representation of each token can 
be a vector, however, it remains the same no matter what the context is. For instance, 
the vector representation of “bank” is the same in both “go to the bank to deposit some 
money” and “go to the bank to sit down”. Thus, many more recent pretraining models 
adapt representation of the same token to different contexts. Among them is BERT, a much 
deeper self-supervised model based on the Transformer encoder. In this chapter, we will 
focus on how to pretrain such representations for text, as highlighted in Fig. 15.1. 


For sight of the big picture, Fig. 15.1 shows that the pretrained text representations can be 
fed to a variety of deep learning architectures for different downstream natural language 
processing applications. We will cover them in Chapter 16. 
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 Pretrained text representations can be fed to various deep learning architectures for 
different downstream natural language processing applications. This chapter focuses on 
the upstream text representation pretraining. 


15.1 Word Embedding (word2vec) 


Natural language is a complex system used to express meanings. In this system, words 
are the basic unit of the meaning. As the name implies, word vectors are vectors used to 
represent words, and can also be considered as feature vectors or representations of words. 
The technique of mapping words to real vectors is called word embedding. In recent years, 
word embedding has gradually become the basic knowledge of natural language process- 
ing. 


15.1.1 One-Hot Vectors Are a Bad Choice 


We used one-hot vectors to represent words (characters are words) in Section 9.5. Suppose 
that the number of different words in the dictionary (the dictionary size) is N, and each 
word corresponds to a different integer (index) from 0 to N — 1. To obtain the one-hot 
vector representation for any word with index i, we create a length-N vector with all Os and 
set the element at position i to 1. In this way, each word is represented as a vector of length 
N, and it can be used directly by neural networks. 


Although one-hot word vectors are easy to construct, they are usually not a good choice. A 
main reason is that one-hot word vectors cannot accurately express the similarity between 
different words, such as the cosine similarity that we often use. For vectors x, y € R4, their 
cosine similarity is the cosine of the angle between them: 
x'y 
IIxlllly I 
Since the cosine similarity between one-hot vectors of any two different words is 0, one-hot 


€ [+1,1]. (15.1.1) 


vectors cannot encode similarities among words. 


227 


15.1.2 Self-Supervised word2vec 


The word2vec 2?’ tool was proposed to address the above issue. It maps each word to a 
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fixed-length vector, and these vectors can better express the similarity and analogy relation- 
ship among different words. The word2vec tool contains two models, namely skip-gram 
(Mikolov et al., 2013) and continuous bag of words (CBOW) (Mikolov et al., 2013). For 
semantically meaningful representations, their training relies on conditional probabilities 
that can be viewed as predicting some words using some of their surrounding words in cor- 
pora. Since supervision comes from the data without labels, both skip-gram and continuous 
bag of words are self-supervised models. 


In the following, we will introduce these two models and their training methods. 


15.1.3 The Skip-Gram Model 


The skip-gram model assumes that a word can be used to generate its surrounding words in 
a text sequence. Take the text sequence “the”, “man”, “loves”, “his”, “son” as an example. 
Let’s choose “loves” as the center word and set the context window size to 2. As shown in 
Fig. 15.1.1, given the center word “loves”, the skip-gram model considers the conditional 
probability for generating the context words: “the”, “man”, “his”, and “son”, which are no 


more than 2 words away from the center word: 
P(”the”,” man”, ”his”,” son” | ”loves”). (15.1.2) 


Assume that the context words are independently generated given the center word (i.e., 
conditional independence). In this case, the above conditional probability can be rewritten 
as 


P(’the” | loves”) - P(’man’ | ”loves”) - P("his” | ”loves”) - P(’’son” | loves”). 
(15.1.3) 


the man his son 


loves 


The skip-gram model considers the conditional probability of generating the surrounding 
context words given a center word. 


In the skip-gram model, each word has two d-dimensional-vector representations for cal- 
culating conditional probabilities. More concretely, for any word with index i in the dic- 
tionary, denote by v; € R¢ and u; € R its two vectors when used as a center word and a 
context word, respectively. The conditional probability of generating any context word wo 
(with index o in the dictionary) given the center word we (with index c in the dictionary) 
can be modeled by a softmax operation on vector dot products: 


exp(U, Ve) 


P oOo ¢ = ey a er) 
ie We) Biev exp(u] vc) 


(15.1.4) 
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where the vocabulary index set V = {0,1,...,|V| — 1}. Given a text sequence of length 
T, where the word at time step t is denoted as w“). Assume that context words are in- 
dependently generated given any center word. For context window size m, the likelihood 
function of the skip-gram model is the probability of generating all context words given 
any center word: 


T 


TL] [] Piw), (15.1.5) 


t=] -—m<j<m, j#0 


where any time step that is less than 1 or greater than T can be omitted. 


Training 


The skip-gram model parameters are the center word vector and context word vector for 
each word in the vocabulary. In training, we learn the model parameters by maximizing the 
likelihood function (i.e., maximum likelihood estimation). This is equivalent to minimizing 
the following loss function: 


T 
-5 dog POW |w®). (15.1.6) 


t=1 —m<j<m, j+0 


When using stochastic gradient descent to minimize the loss, in each iteration we can ran- 
domly sample a shorter subsequence to calculate the (stochastic) gradient for this subse- 
quence to update the model parameters. To calculate this (stochastic) gradient, we need 
to obtain the gradients of the log conditional probability with respect to the center word 
vector and the context word vector. In general, according to (15.1.4) the log conditional 
probability involving any pair of the center word we and the context word wo is 


log P(Wo | We) = u} ve — log [> ota vo] : (15.1.7) 
ieV 


Through differentiation, we can obtain its gradient with respect to the center word vector 
Ve as 
E 
dlog P(wo | we) Djev exp(u; Vc)uj 
a LS | ee ae 
Liev exp(u; Ve) 

exp(Uj Ve) 

= Up — een ey 
jEV Liev exp(uj Vc) 


=u- )) P(w; | we) uy. 


JEV 


OVe 


(15.1.8) 


Note that the calculation in (15.1.8) requires the conditional probabilities of all words in 
the dictionary with w, as the center word. The gradients for the other word vectors can be 
obtained in the same way. 


After training, for any word with index i in the dictionary, we obtain both word vectors 
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v; (as the center word) and u; (as the context word). In natural language processing ap- 
plications, the center word vectors of the skip-gram model are typically used as the word 
representations. 


15.1.4 The Continuous Bag of Words (CBOW) Model 


The continuous bag of words (CBOW) model is similar to the skip-gram model. The major 
difference from the skip-gram model is that the continuous bag of words model assumes that 
a center word is generated based on its surrounding context words in the text sequence. For 
example, in the same text sequence “the”, “man”, “loves”, “his”, and “son”, with “loves” as 
the center word and the context window size being 2, the continuous bag of words model 
considers the conditional probability of generating the center word “loves” based on the 


39 ee 


context words “the”, “man”, “his” and “son” (as shown in Fig. 15.1.2), which is 


P(”loves” | ”the”,” man”, ”his”, ” son”). (15.1.9) 


loves 


the man his son 


The continuous bag of words model considers the conditional probability of generating 
the center word given its surrounding context words. 


Since there are multiple context words in the continuous bag of words model, these context 
word vectors are averaged in the calculation of the conditional probability. Specifically, for 
any word with index i in the dictionary, denote by v; € R? and u; € Rf its two vectors 
when used as a context word and a center word (meanings are switched in the skip-gram 
model), respectively. The conditional probability of generating any center word we (with 
index c in the dictionary) given its surrounding context words Wo,,...,Wo>,, (with index 
01, . . -, 02m in the dictionary) can be modeled by 


exp (Hulva E = Vom) 
(15.1.10) 


P(We | Woi»: -s Wozm) = 1 
Liey exp (Lu (vo, +... + Vom)) 


For brevity, let W, = {Wo,,.--. Won, }and Vo = (Vo, +... + Voon ) /(2m). Then (15.1.10) 
can be simplified as 


exp (ul vo) 


P(w | Wo) = =... 
a a a T E 


(15.1.11) 


Given a text sequence of length T, where the word at time step ¢ is denoted as w“). For 
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context window size m, the likelihood function of the continuous bag of words model is 
the probability of generating all center words given their context words: 


T 
PEN E E Os cl), (15.1.12) 


t=1 


Training 


Training continuous bag of words models is almost the same as training skip-gram models. 
The maximum likelihood estimation of the continuous bag of words model is equivalent to 
minimizing the following loss function: 


T 
= ys log Pw | wO™ | wD, wD wi), (15.1.13) 


t=1 


Notice that 


log P(we | Wo) = ul Vo — log [> exp ls) ; (15.1.14) 
iey 

Through differentiation, we can obtain its gradient with respect to any context word vector 

Vo, = 1,...,2m) as 


Alog P(we|Wo) _ 1 exp(uj Vo) Yj l 
Slog Pow TWO . 1 fy aTe) |7 me (te Dy Ps | Wou; |. 


Ae Cc Wo. oe fase A 
Vo; 2m jEV Ziev exp(u; Vo) 2m jEV 


(15.1.15) 


The gradients for the other word vectors can be obtained in the same way. Unlike the skip- 
gram model, the continuous bag of words model typically uses context word vectors as the 
word representations. 


15.1.5 Summary 


e Word vectors are vectors used to represent words, and can also be considered as feature 
vectors or representations of words. The technique of mapping words to real vectors 
is called word embedding. 


The word2vec tool contains both the skip-gram and continuous bag of words models. 


The skip-gram model assumes that a word can be used to generate its surrounding words 
in a text sequence; while the continuous bag of words model assumes that a center 
word is generated based on its surrounding context words. 


15.1.6 Exercises 


1. What is the computational complexity for calculating each gradient? What could be the 


issue if the dictionary size is huge? 
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2. Some fixed phrases in English consist of multiple words, such as “new york”. How to 
train their word vectors? Hint: see Section 4 in the word2vec paper (Mikolov et al., 
2013). 


3. Let’s reflect on the word2vec design by taking the skip-gram model as an example. What 
is the relationship between the dot product of two word vectors in the skip-gram model 
and the cosine similarity? For a pair of words with similar semantics, why may the 
cosine similarity of their word vectors (trained by the skip-gram model) be high? 


Discussions??°. 


15.2 Approximate Training 
E) 


Recall our discussions in Section 15.1. The main idea of the skip-gram model is using 
softmax operations to calculate the conditional probability of generating a context word 
Wo based on the given center word wc in (15.1.4), whose corresponding logarithmic loss 
is given by the opposite of (15.1.7). 


Due to the nature of the softmax operation, since a context word may be anyone in the 
dictionary V, the opposite of (15.1.7) contains the summation of items as many as the 
entire size of the vocabulary. Consequently, the gradient calculation for the skip-gram 
model in (15.1.8) and that for the continuous bag-of-words model in (15.1.15) both contain 
the summation. Unfortunately, the computational cost for such gradients that sum over a 
large dictionary (often with hundreds of thousands or millions of words) is huge! 


In order to reduce the aforementioned computational complexity, this section will introduce 
two approximate training methods: negative sampling and hierarchical softmax. Due to the 
similarity between the skip-gram model and the continuous bag of words model, we will 
just take the skip-gram model as an example to describe these two approximate training 
methods. 


15.2.1 Negative Sampling 


Negative sampling modifies the original objective function. Given the context window of 
a center word we, the fact that any (context) word wọ comes from this context window is 
considered as an event with the probability modeled by 


P(D =1 | We, Wo) = o (U3 Ve), (15.2.1) 


where o uses the definition of the sigmoid activation function: 


1 
~ 1+exp(—x)" 


a(x) (15.2.2) 


Let’s begin by maximizing the joint probability of all such events in text sequences to train 
word embeddings. Specifically, given a text sequence of length T, denote by w“’) the word 
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at time step ¢ and let the context window size be m, consider maximizing the joint proba- 
bility 


T 
I] I] P(D =1 | w®, wo), (15.2.3) 


t=1 -m<j<m, j+0 


However, (15.2.3) only considers those events that involve positive examples. As a result, 
the joint probability in (15.2.3) is maximized to 1 only if all the word vectors are equal 
to infinity. Of course, such results are meaningless. To make the objective function more 
meaningful, negative sampling adds negative examples sampled from a predefined distri- 
bution. 


Denote by S the event that a context word wọ comes from the context window of a cen- 
ter word we. For this event involving wo, from a predefined distribution P(w) sample K 
noise words that are not from this context window. Denote by Nx the event that a noise 
word wx (k = 1,..., K) does not come from the context window of we. Assume that these 
events involving both the positive example and negative examples S, N1, . . ., Ng are mutu- 
ally independent. Negative sampling rewrites the joint probability (involving only positive 
examples) in (15.2.3) as 


T 


LT] [] Piw), (15.2.4) 


t=] -—m<j<m, j#0 


where the conditional probability is approximated through events S, N1, ..., Ng: 
Pw) | w) = P(D=1) ww) [| P(D=0]w®, wx). (15.2.5) 
k=1, we~P(w) 


Denote by i; and hx the indices of a word w“”) at time step t of a text sequence and a noise 
word wx, respectively. The logarithmic loss with respect to the conditional probabilities in 


(15.2.5) is 
l K 
-log P(w*) | w®) =- log P(D =1 | ww) — 3, log P(D =0 | w®, wx) 
k=1, we~P(w) 


K 
=- log o (ur va) - pS log (1 -0 (uj, va)) 
k=1, we~P(w) 
K 
=- log o (uz, va) - » log (-uj, va) . 
k=1, we~P(w) 


(15.2.6) 


We can see that now the computational cost for gradients at each training step has nothing 
to do with the dictionary size, but linearly depends on K. When setting the hyperparameter 
K toasmaller value, the computational cost for gradients at each training step with negative 
sampling is smaller. 
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15.2.2 Hierarchical Softmax 


As an alternative approximate training method, hierarchical softmax uses the binary tree, a 
data structure illustrated in Fig. 15.2.1, where each leaf node of the tree represents a word 
in dictionary V. 


n(wa, 1) 


Hierarchical softmax for approximate training, where each leaf node of the tree represents 
a word in the dictionary. 


Denote by L(w) the number of nodes (including both ends) on the path from the root node 
to the leaf node representing word w in the binary tree. Let n(w, j) be the j"" node on this 
path, with its context word vector being u,(y,;). For example, L(w3) = 4 in Fig. 15.2.1. 
Hierarchical softmax approximates the conditional probability in (15.1.4) as 


L(wo)-1 
Pwo lwe)= [| o (inwo j+ 1) = leftChild(n(wo, jN Wn y,.)¥e) + 
j=l 
(15.2.7) 


where function ø is defined in (15.2.2), and leftChild(1) is the left child node of node n: 
if x is true, [[x]] = 1; otherwise [[x]] = —1. 


To illustrate, let’s calculate the conditional probability of generating word w3 given word 
we in Fig. 15.2.1. This requires dot products between the word vector ve of we and non- 
leaf node vectors on the path (the path in bold in Fig. 15.2.1) from the root to w3, which is 
traversed left, right, then left: 


P(w3 | We) = 7 (Wi (ws,1) Ve) . F(-Uniy,2) Ve) ` E (Uj y,,3)Ve)- (15.2.8) 


Since o (x) + o(—x) = 1, it holds that the conditional probabilities of generating all the 
words in dictionary V based on any word we sum up to one: 


X, P(w | we) = 1, (15.2.9) 
weVv 
Fortunately, since L(wo) — 1 is on the order of O(log |V|) due to the binary tree struc- 
ture, when the dictionary size V is huge, the computational cost for each training step us- 
ing hierarchical softmax is significantly reduced compared with that without approximate 
training. 
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15.2.3 Summary 


e Negative sampling constructs the loss function by considering mutually independent 
events that involve both positive and negative examples. The computational cost for 
training is linearly dependent on the number of noise words at each step. 


e Hierarchical softmax constructs the loss function using the path from the root node to 
the leaf node in the binary tree. The computational cost for training is dependent on 
the logarithm of the dictionary size at each step. 


15.2.4 Exercises 


1. How can we sample noise words in negative sampling? 
2. Verify that (15.2.9) holds. 


3. How to train the continuous bag of words model using negative sampling and hierarchi- 
cal softmax, respectively? 


Discussions 22°. 


15.3 The Dataset for Pretraining Word Embeddings 


Now that we know the technical details of the word2vec models and approximate training 
methods, let’s walk through their implementations. Specifically, we will take the skip- 
gram model in Section 15.1 and negative sampling in Section 15.2 as an example. In this 
section, we begin with the dataset for pretraining the word embedding model: the original 
format of the data will be transformed into minibatches that can be iterated over during 
training. 


import collections 

import math 

import os 

import random 

import torch 

from d21 import torch as d21 


15.3.1 Reading the Dataset 


The dataset that we use here is Penn Tree Bank (PTB)?2°. This corpus is sampled from Wall 
Street Journal articles, split into training, validation, and test sets. In the original format, 
each line of the text file represents a sentence of words that are separated by spaces. Here 
we treat each word as a token. 
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#@save 
d21.DATA_HUB[’ptb’] = (d21.DATA_URL + 'ptb.zip’, 
"319d85e578af0cdc590547f26231e4e31cdf1e42') 


#@save 
def read_ptb(): 
"""!| oad the PTB dataset into a list of text lines.”"” 
data_dir = d21.download_extract('ptb’) 
# Read the training set 
with open(os.path.join(data_dir, ‘ptb.train.txt’)) as f: 
raw_text = f.read() 
return [line.split() for line in raw_text.split(’\n')] 


sentences = read_ptb() 
f'# sentences: {len(sentences) }’ 


Downloading ../data/ptb.zip from http://d21-data.s3-accelerate.amazonaws.com/ 
optb.zip... 


'# sentences: 42069’ 


After reading the training set, we build a vocabulary for the corpus, where any word that 
appears less than 10 times is replaced by the “<unk>” token. Note that the original dataset 
also contains “<unk>” tokens that represent rare (unknown) words. 


vocab = d21.Vocab(sentences, min_freq=10) 
f'’vocab size: {len(vocab) }’ 


"vocab size: 6719' 


15.3.2 Subsampling 


99 6699 


Text data typically have high-frequency words such as “the”, “a”, and “in”: they may even 
occur billions of times in very large corpora. However, these words often co-occur with 
many different words in context windows, providing little useful signals. For instance, 
consider the word “chip” in a context window: intuitively its co-occurrence with a low- 
frequency word “intel” is more useful in training than the co-occurrence with a high- 
frequency word “a”. Moreover, training with vast amounts of (high-frequency) words is 
slow. Thus, when training word embedding models, high-frequency words can be sub- 
sampled (Mikolov et al., 2013). Specifically, each indexed word w; in the dataset will be 


discarded with probability 
t 
P(w;) = max (1 - Van) (15.3.1) 
Swi) 


where f(w;) is the ratio of the number of words w; to the total number of words in the 
dataset, and the constant f is a hyperparameter (1074 in the experiment). We can see that 
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only when the relative frequency f(w;) > t can the (high-frequency) word w; be discarded, 
and the higher the relative frequency of the word, the greater the probability of being dis- 
carded. 


#@save 

def subsample(sentences, vocab): 
"""Subsample high-frequency words. 
# Exclude unknown tokens (’<unk>’) 
sentences = [[token for token in line if vocab[token] != vocab.unk] 

for line in sentences] 
counter = collections.Counter([ 
token for line in sentences for token in line]) 

num_tokens = sum(counter.values()) 


nnn 


# Return True if ‘token* is kept during subsampling 
def keep(token): 
return(random.uniform(@, 1) < 
math.sqrt(le-4 / counter[token] * num_tokens)) 


return ([[token for token in line if keep(token)] for line in sentences], 
counter) 


subsampled, counter = subsample(sentences, vocab) 
The following code snippet plots the histogram of the number of tokens per sentence be- 


fore and after subsampling. As expected, subsampling significantly shortens sentences by 
dropping high-frequency words, which will lead to training speedup. 


d21.show_list_len_pair_hist(L'origin’, 'subsampled'], '# tokens per sentence’, 
‘count’, sentences, subsampled); 


25000 + 
EE origin 
20000 4 EE subsampled 
# 15000 4 
3 
o 
° 10000 4 
5000 4 
o- | — 7 l 
0 20 40 60 80 


# tokens per sentence 


For individual tokens, the sampling rate of the high-frequency word “the” is less than 
1/20. 


def compare_counts(token) : 
return (f'# of "{token}": ' 
f'before={sum([1.count(token) for 1 in sentences])}, 
f'after={sum(L1.count(token) for 1 in subsampled]) }') 


1 


compare_counts(' the’) 
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'# of "the": before=50770, after=2010' 


In contrast, low-frequency words “join” are completely kept. 


compare_counts(' join’) 
'# of "join": before=45, after=45' 


After subsampling, we map tokens to their indices for the corpus. 


corpus = [vocab[line] for line in subsampled] 
corpusL:3] 


CC], (4127, 3228, 1773], [3922, 1922, 4743, 2696]] 


15.3.3 Extracting Center Words and Context Words 


The following get_centers_and_contexts function extracts all the center words and their 
context words from corpus. It uniformly samples an integer between | and max_window_size 
at random as the context window size. For any center word, those words whose distance 
from it does not exceed the sampled context window size are its context words. 


#@save 
def get_centers_and_contexts(corpus, max_window_size): 
"""Return center words and context words in skip-gram. 
centers, contexts = [], [] 
for line in corpus: 
# To form a "center word--context word” pair, each sentence needs to 
# have at least 2 words 
if len(line) < 2: 
continue 
centers += line 
for i in range(len(line)): # Context window centered at ‘i* 
window_size = random.randint(1, max_window_size) 
indices = list(range(max(@, i - window_size), 
min(len(line), i + 1 + window_size))) 
# Exclude the center word from the context words 
indices. remove(i) 
contexts.append([lineLidx] for idx in indices]) 
return centers, contexts 


nnn 


Next, we create an artificial dataset containing two sentences of 7 and 3 words, respectively. 
Let the maximum context window size be 2 and print all the center words and their context 
words. 
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tiny_dataset = [list(range(7)), list(range(7, 10))] 

print(’dataset’, tiny_dataset) 

for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)): 
print(’center’, center, ‘has contexts’, context) 


dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9J] 


center @ has contexts [1] 

center 1 has contexts [@, 2] 
center 2 has contexts [0, 1, 3, 4] 
center 3 has contexts [1, 2, 4, 5] 
center 4 has contexts [2, 3, 5, 6] 
center 5 has contexts [3, 4, 6] 
center 6 has contexts [5] 

center 7 has contexts [8, 9] 
center 8 has contexts [7, 9] 
center 9 has contexts [7, 8] 


When training on the PTB dataset, we set the maximum context window size to 5. The 
following extracts all the center words and their context words in the dataset. 


all_centers, all_contexts = get_centers_and_contexts(corpus, 5) 
f'# center-context pairs: {sum([len(contexts) for contexts in all_contexts])}' 


'# center-context pairs: 1503420’ 


15.3.4 Negative Sampling 


We use negative sampling for approximate training. To sample noise words according to a 
predefined distribution, we define the following RandomGenerator class, where the (possi- 
bly unnormalized) sampling distribution is passed via the argument sampling_weights. 


#@save 
class RandomGenerator: 
"""Randomly draw among {1, ..., n} according to n sampling weights.””” 
def __init__(self, sampling_weights): 
# Exclude 


self.population = list(range(1, len(sampling_weights) + 1)) 
self.sampling_weights = sampling weights 

self.candidates = [] 

self.i = 0 


def draw(self): 

if self.i == len(self.candidates): 
# Cache ‘k* random sampling results 
self.candidates = random. choices( 

self.population, self.sampling_weights, k=10000) 

self.i = 0 

self.i += 1 

return self.candidates[self.i - 1] 
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For example, we can draw 10 random variables X among indices 1, 2, and 3 with sampling 
probabilities P(X = 1) = 2/9, P(X = 2) = 3/9, and P(X = 3) = 4/9 as follows. 


For a pair of center word and context word, we randomly sample K (5 in the experiment) 
noise words. According to the suggestions in the word2vec paper, the sampling probability 
P(w) of a noise word w is set to its relative frequency in the dictionary raised to the power 
of 0.75 (Mikolov et al., 2013). 


#@save 
def get_negatives(all_contexts, vocab, counter, K): 
"""Return noise words in negative sampling.”"” 
# Sampling weights for words with indices 1, 2, ... (index @ is the 
# excluded unknown token) in the vocabulary 
sampling_weights = [counter[vocab.to_tokens(i)]**0.75 
for i in range(1, len(vocab))] 
all_negatives, generator = [], RandomGenerator(sampling_weights) 
for contexts in all_contexts: 
negatives = [] 
while len(negatives) < len(contexts) * K: 
neg = generator.draw() 
# Noise words cannot be context words 
if neg not in contexts: 
negatives. append(neg) 
all_negatives.append(negatives) 
return all_negatives 


all_negatives = get_negatives(all_contexts, vocab, counter, 5) 


15.3.5 Loading Training Examples in Minibatches 


After all the center words together with their context words and sampled noise words are 
extracted, they will be transformed into minibatches of examples that can be iteratively 
loaded during training. 


In a minibatch, the i™® example includes a center word and its n; context words and m; noise 
words. Due to varying context window sizes, n; + m; varies for different i. Thus, for each 
example we concatenate its context words and noise words in the contexts_negatives 
variable, and pad zeros until the concatenation length reaches max; n; + m; (max_len). To 
exclude paddings in the calculation of the loss, we define a mask variable masks. There is a 
one-to-one correspondence between elements in masks and elements in contexts_negatives, 
where zeros (otherwise ones) in masks correspond to paddings in contexts_negatives. 


To distinguish between positive and negative examples, we separate context words from 
noise words in contexts_negatives via a labels variable. Similar to masks, there is also 

a one-to-one correspondence between elements in labels and elements in contexts_negatives, 
where ones (otherwise zeros) in labels correspond to context words (positive examples) 

in contexts_negatives. 


The above idea is implemented in the following batchify function. Its input data is a 
list with length equal to the batch size, where each element is an example consisting of 
the center word center, its context words context, and its noise words negative. This 
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function returns a minibatch that can be loaded for calculations during training, such as 
including the mask variable. 


#@save 
def batchify(data): 
"""Return a minibatch of examples for skip-gram with negative sampling. 
max_len = max(len(c) + len(n) for _, c, n in data) 
centers, contexts_negatives, masks, labels = [], [J], CJ, [] 
for center, context, negative in data: 
cur_len = len(context) + len(negative) 
centers += [center] 
contexts_negatives += [context + negative + [0] * (max_len - cur_len)] 
masks += [[1] * cur_len + [9] * (max_len - cur_len)] 
labels += [[1] * len(context) + [0] * (max_len - len(context))] 
return (torch.tensor(centers).reshape((-1, 1)), torch.tensor( 
contexts_negatives), torch.tensor(masks), torch.tensor(labels)) 


nnn 


Let’s test this function using a minibatch of two examples. 


cis Gl, (2) 2), (sy So Sy SI) 
“e2e= Cl. (2, 2, 2, 8, 2p 
batch = batchify((x_1, x_2)) 
names = ['centers’, ‘contexts_negatives’, 'masks', ‘'labels’] 
for name, data in zip(names, batch): 

print(name, '=', data) 


centers = tensor([[1], 
[1]]) 
contexts_negatives = tensor([[2, 2, 3, 3, 3, 3], 
[2)..2 2). 343-00] 
masks = tensor([[1, 1, 1, 1, 1, 1], 
[1, 1, 1, 1, 1, 0]]) 
labels = tensor([[1, 1, 9, ð, ð, @], 
[1, 1, 1, ð, ®, 0]]) 


15.3.6 Putting It All Together 


Last, we define the load_data_ptb function that reads the PTB dataset and returns the data 
iterator and the vocabulary. 


#@save 
def load_data_ptb(batch_size, max_window_size, num_noise_words): 
"""Download the PTB dataset and then load it into memory.””” 
num_workers = d21.get_dataloader_workers() 
sentences = read_ptb() 
vocab = d21.Vocab(sentences, min_freq=10) 
subsampled, counter = subsample(sentences, vocab) 
corpus = [vocab[line] for line in subsampled] 
all_centers, all_contexts = get_centers_and_contexts( 
corpus, max_window_size) 


(continues on next page) 
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(continued from previous page) 


all_negatives = get_negatives( 
all_contexts, vocab, counter, num_noise_words) 


class PTBDataset(torch.utils.data.Dataset): 
def __init__(self, centers, contexts, negatives): 
assert len(centers) == len(contexts) == len(negatives) 
self.centers = centers 
self.contexts = contexts 
self.negatives = negatives 


def __getitem__(self, index): 
return (self.centers[Lindex], self.contextsLindex], 
self .negatives[index]) 


def __len__(self): 
return len(self.centers) 


dataset = PTBDataset(all_centers, all_contexts, all_negatives) 


data_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True, 
collate_fn=batchify, 
num_workers=num_workers) 

return data_iter, vocab 


Let’s print the first minibatch of the data iterator. 


data_iter, vocab = load_data_ptb(512, 5, 5) 
for batch in data_iter: 
for name, data in zip(names, batch): 
print(name, ‘shape:’, data.shape) 
break 


centers shape: torch.Size([512, 1]) 
contexts_negatives shape: torch.Size([512, 60]) 
masks shape: torch.Size([512, 60]) 
labels shape: torch.Size([512, 60]) 


15.3.7 Summary 


e High-frequency words may not be so useful in training. We can subsample them for 
speedup in training. 


e For computational efficiency, we load examples in minibatches. We can define other 
variables to distinguish paddings from non-paddings, and positive examples from neg- 
ative ones. 


15.3.8 Exercises 


1. How does the running time of code in this section changes if not using subsampling? 
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2. The RandomGenerator class caches k random sampling results. Set k to other values 
and see how it affects the data loading speed. 


3. What other hyperparameters in the code of this section may affect the data loading 
speed? 


Discussions"! . 


15.4 Pretraining word2vec 
SSS ees 


We go on to implement the skip-gram model defined in Section 15.1. Then we will pretrain 
word2vec using negative sampling on the PTB dataset. First of all, let’s obtain the data 
iterator and the vocabulary for this dataset by calling the d21.load_data_ptb function, 
which was described in Section 15.3 


import math 

import torch 

from torch import nn 

from d21 import torch as d21 


batch_size, max_window_size, num_noise_words = 512, 5, 5 
data_iter, vocab = d21.load_data_ptb(batch_size, max_window_size, 
num_noise_words) 


15.4.1 The Skip-Gram Model 


We implement the skip-gram model by using embedding layers and batch matrix multipli- 
cations. First, let’s review how embedding layers work. 


Embedding Layer 


As described in Section 10.7, an embedding layer maps a token’s index to its feature vec- 
tor. The weight of this layer is a matrix whose number of rows equals to the dictio- 
nary size (input_dim) and number of columns equals to the vector dimension for each 
token (output_dim). After a word embedding model is trained, this weight is what we 
need. 


embed = nn.Embedding(num_embeddings=20, embedding_dim=4) 
print(f'Parameter embedding_weight ({embed.weight. shape}, 
f'dtype={embed.weight.dtype}) ’) 


1 


Parameter embedding_weight (torch.Size([20, 4]), dtype=torch.float32) 


The input of an embedding layer is the index of a token (word). For any token index i, its 
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vector representation can be obtained from the i™ row of the weight matrix in the embedding 
layer. Since the vector dimension (output_dim) was set to 4, the embedding layer returns 
vectors with shape (2, 3, 4) for a minibatch of token indices with shape (2, 3). 


x = torch.tensor([[1, 2, 3], [4, 5, 6]]) 
embed (x) 


tensor([[L 0.7606, 0.3872, -@.1864, 1.1732], 


[ ð. 
[ 1.5035, 2.3623, -1.7542, -1.4990], 
[-1.2639, -1.5313, 2.1719, 0.4151]], 


[[-1.9079, 0.2434, 1.5395, 1.2990], 


[-1. 
[ 0.7470, 1.0129, @.4039, @.0591], 
[-@.6293, -@.1814, -@.4782, -@.5289]]], grad_fn=<EmbeddingBackward@>) 


Defining the Forward Propagation 


In the forward propagation, the input of the skip-gram model includes the center word 
indices center of shape (batch size, 1) and the concatenated context and noise word indices 
contexts_and_negatives of shape (batch size, max_len), where max_len is defined in 
Section 15.3.5. These two variables are first transformed from the token indices into vectors 
via the embedding layer, then their batch matrix multiplication (described in Section 11.3.2) 
returns an output of shape (batch size, 1, max_len). Each element in the output is the dot 
product of a center word vector and a context or noise word vector. 


def skip_gram(center, contexts_and_negatives, embed_v, embed_u): 
v = embed_v(center) 
u = embed_u(contexts_and_negatives) 
pred = torch.bmm(v, u.permute(@, 2, 1)) 
return pred 


Let’s print the output shape of this skip_gram function for some example inputs. 


skip_gram(torch.ones((2, 1), dtype=torch. long), 
torch.ones((2, 4), dtype=torch.long), embed, embed) .shape 


torch.Size([2, 1, 4]) 


15.4.2 Training 


Before training the skip-gram model with negative sampling, let’s first define its loss func- 
tion. 


Binary Cross-Entropy Loss 


According to the definition of the loss function for negative sampling in Section 15.2.1, we 
will use the binary cross-entropy loss. 
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class SigmoidBCELoss(nn.Module): 
# Binary cross-entropy loss with masking 
def __init__(self): 
super().__init__Q 


def forward(self, inputs, target, mask=None): 
out = nn.functional.binary_cross_entropy_with_logits( 
inputs, target, weight=mask, reduction="none” 
return out.mean(dim=1) 


loss = SigmoidBCELoss() 


Recall our descriptions of the mask variable and the label variable in Section 15.3.5. The 
following calculates the binary cross-entropy loss for the given variables. 


pred = torch.tensor([[1.1, -2.2, 3.3, -4.4]] * 2) 

label = torch.tensor([L1.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, @.0]]) 
mask = torch.tensor([[1, 1, 1, 1], [1, 1, ð, 0]]) 

loss(pred, label, mask) * mask.shape[1] / mask.sum(axis=1) 


tensor(L@.9352, 1.8462]) 


Below shows how the above results are calculated (in a less efficient way) using the sigmoid 
activation function in the binary cross-entropy loss. We can consider the two outputs as two 
normalized losses that are averaged over non-masked predictions. 


def sigmd(x): 
return -math.log(1 / (1 + math.exp(-x))) 


print(f'{(sigmd(1.1) + sigmd(2.2) + sigmd(-3.3) + sigmd(4.4)) / 4:.4f}') 
print(f'{(sigmd(-1.1) + sigmd(-2.2)) / 2:.4f}') 


9.9352 
1.8462 


Initializing Model Parameters 


We define two embedding layers for all the words in the vocabulary when they are used as 
center words and context words, respectively. The word vector dimension embed_size is 
set to 100. 


embed_size = 100 
net = nn.Sequential(nn.Embedding(num_embeddings=len(vocab), 
embedding_dim=embed_size), 
nn. Embedding (num_embeddings=len(vocab) , 
embedding_dim=embed_size) ) 
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Defining the Training Loop 


The training loop is defined below. Because of the existence of padding, the calculation of 
the loss function is slightly different compared to the previous training functions. 


def train(net, data_iter, lr, num_epochs, device=d21.try_gpu()): 
def init_weights(module): 
if type(module) == nn.Embedding: 
nn.init.xavier_uniform_(module. weight) 
net.apply(init_weights) 
net = net.to(device) 
optimizer = torch.optim.Adam(net.parameters(), lr=1r) 
animator = d21.Animator(xlabel='epoch’, ylabel='loss’, 
xlim=[1, num_epochs]) 
# Sum of normalized losses, no. of normalized losses 
metric = d21.Accumulator(2) 
for epoch in range(num_epochs) : 
timer, num_batches = d21.Timer(), len(data_iter) 
for i, batch in enumerate(data_iter): 
optimizer .zero_grad() 
center, context_negative, mask, label = [ 
data.to(device) for data in batch] 


pred = skip_gram(center, context_negative, net[9], net[1]) 
1 = (loss(pred.reshape(label.shape).float(), label.float(), mask) 
/ mask.sum(axis=1) * mask.shape[1]) 
1.sum() . backward() 
optimizer.step() 
metric.add(1l.sum(), 1.numel()) 
if (i + 1) % (num_batches // 5) == @ or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(metricl@] / metric[1],)) 
print(f'loss {metriclQ] / metric[1]:.3f}, ' 
f'{metric[1] / timer.stop():.1f} tokens/sec on {str(device) }’) 


Now we can train a skip-gram model using negative sampling. 


lr, num_epochs = 0.092, 5 
train(net, data_iter, lr, num_epochs) 


loss 0.410, 223485.0 tokens/sec on cuda:@ 
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15.4.3 Applying Word Embeddings 


After training the word2vec model, we can use the cosine similarity of word vectors from 
the trained model to find words from the dictionary that are most semantically similar to 
an input word. 


def get_similar_tokens(query_token, k, embed): 
W = embed.weight.data 
x = WLvocab[query_token]] 
# Compute the cosine similarity. Add 1e-9 for numerical stability 
cos = torch.mv(W, x) / torch.sqrt(torch.sum(W * W, dim=1) * 
torch.sum(x * x) + le-9) 
topk = torch.topk(cos, k=k+1)[1].cpuQ .numpy().astype(' int32’) 
for i in topk[1:]: # Remove the input words 
print(f'cosine sim={float(cos[i]):.3f}: {vocab.to_tokens(i)}') 


get_similar_tokens('chip’, 3, net[Q]) 


cosine sim=0.702: microprocessor 
cosine sim=0.649: mips 
cosine sim=0.643: intel 


15.4.4 Summary 


e We can train a skip-gram model with negative sampling using embedding layers and the 
binary cross-entropy loss. 


e Applications of word embeddings include finding semantically similar words for a given 
word based on the cosine similarity of word vectors. 


15.4.5 Exercises 


1. Using the trained model, find semantically similar words for other input words. Can you 
improve the results by tuning hyperparameters? 


2. When a training corpus is huge, we often sample context words and noise words for 
the center words in the current minibatch when updating model parameters. In other 
words, the same center word may have different context words or noise words in different 
training epochs. What are the benefits of this method? Try to implement this training 
method. 


Discussions 2?. 


15.5 Word Embedding with Global Vectors (GloVe) 
a 


Word-word co-occurrences within context windows may carry rich semantic information. 
For example, in a large corpus word “solid” is more likely to co-occur with “ice” than 
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“steam”, but word “gas” probably co-occurs with “steam” more frequently than “ice”. Be- 
sides, global corpus statistics of such co-occurrences can be precomputed: this can lead 
to more efficient training. To leverage statistical information in the entire corpus for word 
embedding, let’s first revisit the skip-gram model in Section 15.1.3, but interpreting it using 
global corpus statistics such as co-occurrence counts. 


15.5.1 Skip-Gram with Global Corpus Statistics 


Denoting by qij the conditional probability P(w; | w;) of word w; given word w; in the 
skip-gram model, we have 


exp(u vj) 


dase TEE (15.5.1) 
Dkev eXp(U; vi) 


qij 
where for any index i vectors v; and u; represent word w; as the center word and context 
word, respectively, and V = {0,1,...,|V| — 1} is the index set of the vocabulary. 


Consider word w; that may occur multiple times in the corpus. In the entire corpus, all the 
context words wherever w; is taken as their center word form a multiset C; of word indices 
that allows for multiple instances of the same element. For any element, its number of in- 
stances is called its multiplicity. To illustrate with an example, suppose that word w; occurs 
twice in the corpus and indices of the context words that take w; as their center word in the 
two context windows are k, j,m,k and k,l, k, j. Thus, multiset C; = {j, j,k, k, k, k, l,m}, 
where multiplicities of elements j, k,/,m are 2, 4, 1, 1, respectively. 


Now let’s denote the multiplicity of element j in multiset C; as x;;. This is the global co- 
occurrence count of word w; (as the context word) and word w; (as the center word) in 
the same context window in the entire corpus. Using such global corpus statistics, the loss 
function of the skip-gram model is equivalent to 


- J 2, 7u 108 dij. (15.5.2) 


ieV jeV 
We further denote by x; the number of all the context words in the context windows where 


w; occurs as their center word, which is equivalent to |C;|. Letting p;; be the conditional 
probability x;;/x; for generating context word w; given center word w;, (15.5.2) can be 


rewritten as 
- Dri D Piz los qij: (15.5.3) 
iY Jerv 


In (15.5.3), - È jevy Pij log qij calculates the cross-entropy of the conditional distribution 
Pij Of global corpus statistics and the conditional distribution q;; of model predictions. This 
loss is also weighted by x; as explained above. Minimizing the loss function in (15.5.3) 


will allow the predicted conditional distribution to get close to the conditional distribution 
from the global corpus statistics. 


Though being commonly used for measuring the distance between probability distributions, 
the cross-entropy loss function may not be a good choice here. On the one hand, as we 
mentioned in Section 15.2, the cost of properly normalizing q;; results in the sum over 
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the entire vocabulary, which can be computationally expensive. On the other hand, a large 
number of rare events from a large corpus are often modeled by the cross-entropy loss to 
be assigned with too much weight. 


15.5.2 The GloVe Model 


In view of this, the GloVe model makes three changes to the skip-gram model based on 
squared loss (Pennington et al., 2014): 


1. Use variables Pij = x;; and qi; = exp(u; Vi) that are not probability distributions 
2 
and take the logarithm of both, so the squared loss term is (iog Pij — log qi) = 


2 
(uzv, > log xi) Š 


2. Add two scalar model parameters for each word w;: the center word bias b; and the 
context word bias c;. 


3. Replace the weight of each loss term with the weight function h(x;;), where h(x) is 
increasing in the interval of [0, 1]. 


Putting all things together, training Glo Ve is to minimize the following loss function: 


2 

> D% h(xij) (ujvit bi +c; — log xij) (15.5.4) 

ieV jeV 
For the weight function, a suggested choice is: h(x) = (x/c)® (e.g œ = 0.75) if x < c (e.g., 
c = 100); otherwise h(x) = 1. In this case, because h(0) = 0, the squared loss term for any 
xij = 0 can be omitted for computational efficiency. For example, when using minibatch 
stochastic gradient descent for training, at each iteration we randomly sample a minibatch 
of non-zero x;; to calculate gradients and update the model parameters. Note that these 
non-zero x;; are precomputed global corpus statistics; thus, the model is called GloVe for 
Global Vectors. 


It should be emphasized that if word w; appears in the context window of word w j, then 
vice versa. Therefore, xi; = xji. Unlike word2vec that fits the asymmetric conditional 
probability pij, GloVe fits the symmetric log x;;. Therefore, the center word vector and 
the context word vector of any word are mathematically equivalent in the GloVe model. 
However in practice, owing to different initialization values, the same word may still get 
different values in these two vectors after training: GloVe sums them up as the output 
vector. 


15.5.3 Interpreting GloVe from the Ratio of Co-occurrence 
Probabilities 


We can also interpret the GloVe model from another perspective. Using the same notation 
in Section 15.5.1, let p;; = P(w; | wi) be the conditional probability of generating the 
context word w; given w; as the center word in the corpus. tab_glove lists several co- 
occurrence probabilities given words “ice” and “steam” and their ratios based on statistics 
from a large corpus. 
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:Word-word co-occurrence probabilities and their ratios from a large corpus (adapted from 
Table 1 in Pennington et al. (2014)) 


Table 15.5.1: label:tab_glove 
Whe solid gas water | fashion 
pı = P(w; | ice) 0.00019 | 0.000066 | 0.003 | 0.000017 
p2 = P(wx | steam) | 0.000022 | 0.00078 | 0.0022 | 0.000018 
Pi/P2 8.9 0.085 1.36 =| 0.96 


We can observe the following from tab_glove: 


e For a word wz, that is related to “ice” but unrelated to “steam”, such as wg = solid, we 
expect a larger ratio of co-occurence probabilities, such as 8.9. 


e For a word wx that is related to “steam” but unrelated to “ice”, such as wg = gas, we 
expect a smaller ratio of co-occurence probabilities, such as 0.085. 


e For a word wx that is related to both “ice” and “steam”, such as wg = water, we expect 
a ratio of co-occurence probabilities that is close to 1, such as 1.36. 


e For a word wx that is unrelated to both “ice” and “steam”, such as wg = fashion, we 
expect a ratio of co-occurence probabilities that is close to 1, such as 0.96. 


It can be seen that the ratio of co-occurrence probabilities can intuitively express the rela- 
tionship between words. Thus, we can design a function of three word vectors to fit this 
ratio. For the ratio of co-occurrence probabilities p;;/pig with w; being the center word and 
w; and w; being the context words, we want to fit this ratio using some function f: 

f (uj, Ux, Vi) © an (15.5.5) 

Pik 

Among many possible designs for f, we only pick a reasonable choice in the following. 
Since the ratio of co-occurrence probabilities is a scalar, we require that f be a scalar 
function, such as f (uj, uk, vi) = f ((uj — ug)" v;). Switching word indices j and k in 
(15.5.5), it must hold that f(x) f(—x) = 1, so one possibility is f(x) = exp(x), i.e., 


exp (u vi] _ ij 


exp (uzv:) Pix (15.5.6) 


f (uj, Uk, Vi) = 


Now let’s pick exp (uzv) = api;, where æ is a constant. Since p;; = x;;/x;, after taking 
the logarithm on both sides we get uj Vi ~ log at+log xij —log x;. We may use additional 
bias terms to fit — log œ + log x;, such as the center word bias b; and the context word bias 
C;: 

J 


ujvi+bi+cj = log Xij. (15.5.7) 


Measuring the squared error of (15.5.7) with weights, the GloVe loss function in (15.5.4) 
is obtained. 
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15.5.4 Summary 


e The skip-gram model can be interpreted using global corpus statistics such as word-word 
co-occurrence counts. 


e The cross-entropy loss may not be a good choice for measuring the difference of two 
probability distributions, especially for a large corpus. GloVe uses squared loss to fit 
precomputed global corpus statistics. 


e The center word vector and the context word vector are mathematically equivalent for 
any word in GloVe. 


e GloVe can be interpreted from the ratio of word-word co-occurrence probabilities. 


15.5.5 Exercises 


1. If words w; and w; co-occur in the same context window, how can we use their distance 
in the text sequence to redesign the method for calculating the conditional probability 
pij? Hint: see Section 4.2 of the GloVe paper (Pennington et al., 2014). 


2. For any word, are its center word bias and context word bias mathematically equivalent 
in GloVe? Why? 


Discussions??? , 


15.6 Subword Embedding 
T) 


In English, words such as “helps”, “helped”, and “helping” are inflected forms of the same 
word “help”. The relationship between “dog” and “dogs” is the same as that between “cat” 
and “cats”, and the relationship between “boy” and “boyfriend” is the same as that between 
“girl” and “girlfriend”. In other languages such as French and Spanish, many verbs have 
over 40 inflected forms, while in Finnish, a noun may have up to 15 cases. In linguistics, 
morphology studies word formation and word relationships. However, the internal structure 
of words was neither explored in word2vec nor in GloVe. 


15.6.1 The fastText Model 


Recall how words are represented in word2vec. In both the skip-gram model and the con- 
tinuous bag-of-words model, different inflected forms of the same word are directly repre- 
sented by different vectors without shared parameters. To use morphological information, 
the fastText model proposed a subword embedding approach, where a subword is a charac- 
ter n-gram (Bojanowski et al., 2017). Instead of learning word-level vector representations, 
fastText can be considered as the subword-level skip-gram, where each center word is rep- 
resented by the sum of its subword vectors. 


Let’s illustrate how to obtain subwords for each center word in fastText using the word 
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66? 


“where”. First, add special characters “<” and “>” at the beginning and end of the word 
to distinguish prefixes and suffixes from other subwords. Then, extract character n-grams 
from the word. For example, when n = 3, we obtain all subwords of length 3: “<wh’, 


nar? 66 


“whe”, “her”, “ere”, “re>”, and the special subword “<where>”’. 


In fastText, for any word w, denote by Gy, the union of all its subwords of length between 
3 and 6 and its special subword. The vocabulary is the union of the subwords of all words. 
Letting z, be the vector of subword g in the dictionary, the vector v,, for word w as a center 
word in the skip-gram model is the sum of its subword vectors: 


me 2 Ag: (15.6.1) 
8EGw 

The rest of fastText is the same as the skip-gram model. Compared with the skip-gram 
model, the vocabulary in fastText is larger, resulting in more model parameters. Besides, 
to calculate the representation of a word, all its subword vectors have to be summed, leading 
to higher computational complexity. However, thanks to shared parameters from subwords 
among words with similar structures, rare words and even out-of-vocabulary words may 
obtain better vector representations in fastText. 


15.6.2 Byte Pair Encoding 


In fastText, all the extracted subwords have to be of the specified lengths, such as 3 to 6, 
thus the vocabulary size cannot be predefined. To allow for variable-length subwords in 
a fixed-size vocabulary, we can apply a compression algorithm called byte pair encoding 
(BPE) to extract subwords (Sennrich et al., 2015). 


Byte pair encoding performs a statistical analysis of the training dataset to discover com- 
mon symbols within a word, such as consecutive characters of arbitrary length. Starting 
from symbols of length 1, byte pair encoding iteratively merges the most frequent pair of 
consecutive symbols to produce new longer symbols. Note that for efficiency, pairs cross- 
ing word boundaries are not considered. In the end, we can use such symbols as subwords 
to segment words. Byte pair encoding and its variants has been used for input representa- 
tions in popular natural language processing pretraining models such as GPT-2 (Radford 
et al., 2019) and RoBERTa (Liu et al., 2019). In the following, we will illustrate how byte 
pair encoding works. 


First, we initialize the vocabulary of symbols as all the English lowercase characters, a 
special end-of-word symbol '_', and a special unknown symbol '[UNK]’. 


1 


import collections 

symbols I [he an Dyer ae a E E a line L T K T 
1 OTUNKIY 

Since we do not consider symbol pairs that cross boundaries of words, we only need a 


dictionary raw_token_freqs that maps words to their frequencies (number of occurrences) 
in a dataset. Note that the special symbol '_’ is appended to each word so that we can 


717 


Subword Embedding 


easily recover a word sequence (e.g., “a taller man”) from a sequence of output symbols 
(e.g., “a_ tall er_ man”). Since we start the merging process from a vocabulary of only 
single characters and special symbols, space is inserted between every pair of consecutive 
characters within each word (Keys of the dictionary token_freqs). In other words, space 
is the delimiter between symbols within a word. 


raw_token_freqs = {'fast_': 4, ‘faster_': 3, 'tall_': 5, ‘taller_’: 4} 
token_freqs = {} 
for token, freq in raw_token_freqs.items(): 

token_freqsL['’ '.join(list(token))] = raw_token_freqs[token] 
token_freqs 


{°F as ts & aes tue rm 2's 3, tard 5, tsa doer os ay 
We define the following get_max_freq_pair function that returns the most frequent pair 


of consecutive symbols within a word, where words come from keys of the input dictionary 
token_freqs. 


def get_max_freq_pair(token_freqs) : 
pairs = collections.defaultdict (int) 
for token, freq in token_freqs.items(): 
symbols = token.split() 
for i in range(len(symbols) - 1): 
# Key of ‘pairs* is a tuple of two consecutive symbols 
pairsLsymbolsLi], symbols[i + 1]] += freq 
return max(pairs, key=pairs.get) # Key of ‘pairs’ with the max value 


As a greedy approach based on frequency of consecutive symbols, byte pair encoding will 
use the following merge_symbols function to merge the most frequent pair of consecutive 
symbols to produce new symbols. 


def merge_symbols(max_freg_pair, token_freqs, symbols): 
symbols. append('’ . join(max_freg_pair) ) 
new_token_freqs = dict() 
for token, freq in token_freqs.items(): 
new_token = token.replace(’ '.join(max_freq_pair), 
"' join(max_freq_pair)) 
new_token_freqs[new_token] = token_freqs[token] 
return new_token_freqs 


Now we iteratively perform the byte pair encoding algorithm over the keys of the dictionary 
token_fregqs. In the first iteration, the most frequent pair of consecutive symbols are 't’ 
and ’a’, thus byte pair encoding merges them to produce a new symbol 'ta’. In the 
second iteration, byte pair encoding continues to merge 'ta’ and '1' to result in another 
new symbol ‘tal’. 


num_merges = 10 
for i in range(num_merges) : 


(continues on next page) 
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max_freq_pair = get_max_freq_pair(token_freqs) 
token_freqs = merge_symbols(max_fregq_pair, token_freqs, symbols) 
print(f'merge #{i + 1}:’, max_freq_pair) 


OO HAS 


merge #1: ( 
merge #2: ( 
merge #3: ( 
merge #4: ( 
merge #5: ( 
( 
( 
( 
C 


ia) ~- 

~- o 
-É ~ 
-v 
-w 


-0 
a 


merge #6: 
merge #7: 
merge #8: 
merge #9: tali‘, =) 
merge #10: (’ fast’ a) 


5 
w 


After 10 iterations of byte pair encoding, we can see that list symbols now contains 10 
more symbols that are iteratively merged from other symbols. 


print(symbols) 


toi Akh Hat tar, fae be tek. dpe wa en. GQ Wa Mabo ae a 
[ a | b 4 c ? d $ e > f ? g $. h ? 1 ? J > k ? l FJ m ? n a o > p 

rg’ rt, st) Per, tut, tv, tw, xt, ty’, tz", L’,  UNKT', ‘ta’, ‘tal 
eg. “tall... ita =. bas... “hask = Ser". Ser talt “fast ) 


For the same dataset specified in the keys of the dictionary raw_token_freqs, each word 
in the dataset is now segmented by subwords “fast_’”, “fast”, “er_’, “tall_”, and “tall” as a 
result of the byte pair encoding algorithm. For instance, words “faster_” 


” 


and “‘taller_”’ are 


segmented as “fast er_” and “tall er_”, respectively. 
print(list(token_freqs.keys())) 
[’fast_', "fast er_', "tall"; “tall er_'] 


Note that the result of byte pair encoding depends on the dataset being used. We can also 
use the subwords learned from one dataset to segment words of another dataset. As a 
greedy approach, the following segment_BPE function tries to break words into the longest 
possible subwords from the input argument symbols. 


def segment_BPE(tokens, symbols): 

outputs = [] 

for token in tokens: 
start, end = ð, len(token) 
cur_output = [] 
# Segment token with the longest possible subwords from symbols 
while start < len(token) and start < end: 

if token[start: end] in symbols: 


(continues on next page) 
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(continued from previous page) 


cur_output.append(token[start: end]) 
start = end 
end = len(token) 
else: 
end -= 1 
if start < len(token): 
cur_output. append(’ [UNK]') 
outputs.append(’ '.join(cur_output)) 
return outputs 


In the following, we use the subwords in list symbols, which is learned from the aforemen- 
tioned dataset, to segment tokens that represent another dataset. 


tokens = ['tallest_', 'fatter_'] 
print(segment_BPE(tokens, symbols)) 


['tall e s t _', 'fatter_'] 


15.6.3 Summary 


e The fastText model proposes a subword embedding approach. Based on the skip-gram 
model in word2vec, it represents a center word as the sum of its subword vectors. 


e Byte pair encoding performs a statistical analysis of the training dataset to discover com- 
mon symbols within a word. As a greedy approach, byte pair encoding iteratively 
merges the most frequent pair of consecutive symbols. 


e Subword embedding may improve the quality of representations of rare words and out- 
of-dictionary words. 


15.6.4 Exercises 


1. As an example, there are about 3 x 108 possible 6-grams in English. What is the issue 
when there are too many subwords? How to address the issue? Hint: refer to the end of 
Section 3.2 of the fastText paper (Bojanowski et al., 2017). 


2. How to design a subword embedding model based on the continuous bag-of-words 
model? 


3. To get a vocabulary of size m, how many merging operations are needed when the initial 
symbol vocabulary size is n? 


4. How to extend the idea of byte pair encoding to extract phrases? 


Discussions 22+. 
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15.7 Word Similarity and Analogy 
eee" 


In Section 15.4, we trained a word2vec model on a small dataset, and applied it to find 
semantically similar words for an input word. In practice, word vectors that are pretrained 
on large corpora can be applied to downstream natural language processing tasks, which 
will be covered later in Chapter 16. To demonstrate semantics of pretrained word vectors 
from large corpora in a straightforward way, let’s apply them in the word similarity and 
analogy tasks. 


import os 

import torch 

from torch import nn 

from d21 import torch as d21 


15.7.1 Loading Pretrained Word Vectors 


Below lists pretrained GloVe embeddings of dimension 50, 100, and 300, which can be 
downloaded from the GloVe website?*°. The pretrained fastText embeddings are available 
in multiple languages. Here we consider one English version (300-dimensional “wiki.en’’) 
that can be downloaded from the fastText website?°° . 


#@save 
d21.DATA_HUB[ 'glove.6b.5@d'] = (d21.DATA_URL + 'glove.6B.5@d.zip’, 
"@b8703943ccdb6eb788e6F091b8946e82231bc4d’) 


#@save 
d21.DATA_HUB[ 'glove.6b.10@d’] = (d21.DATA_URL + ‘glove.6B.100d.zip’', 
"cd43bfb07e44e6f27cbcc7bc9ae3d80284Fdaf5a’ ) 


#@save 
d21.DATA_HUB[ 'glove.42b.300@d'] = (d21.DATA_URL + 'glove.42B.30Q0d.zip’, 
"b5116e234e9eb9076672cfeabf5469F3eec904Ffa' ) 


#@save 
d21.DATA_HUB[ 'wiki.en’] = (d21.DATA_URL + ‘wiki.en.zip’, 
"c1816da3821ae9F43899be655002f6c723e91b88' ) 


To load these pretrained GloVe and fastText embeddings, we define the following Token- 
Embedding class. 


#@save 
class TokenEmbedding: 
"""Token Embedding.””” 
def __init__(self, embedding_name) : 
self.idx_to_token, self.idx_to_vec = self._load_embedding( 
embedding_name) 
self.unknown_idx = @ 


(continues on next page) 
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self.token_to_idx = {token: idx for idx, token in 
enumerate(self .idx_to_token) } 


def _load_embedding(self, embedding_name) : 
idx_to_token, idx_to_vec = [’<unk>'], [] 
data_dir = d21.download_extract (embedding_name) 
# GloVe website: https://nlp.stanford.edu/projects/glove/ 
# fastText website: https://fasttext.cc/ 
with open(os.path.join(data_dir, ‘vec.txt’'), ‘r’) as f: 
fOmelaines inet: 
elems = line.rstrip().split(’ ') 
token, elems = elems[@], [float(elem) for elem in elems[1:]] 
# Skip header information, such as the top row in fastText 
if len(elems) > 1: 
idx_to_token. append(token) 
idx_to_vec.append(elems) 
idx_to_vec = [[@] * len(idx_to_vec[0])] + idx_to_vec 
return idx_to_token, torch. tensor(idx_to_vec) 


def __getitem__(self, tokens): 
indices = [self.token_to_idx.get(token, self.unknown_idx) 
for token in tokens] 
vecs = self.idx_to_vec[torch. tensor(indices) ] 
return vecs 
def __len__(self): 
return len(self.idx_to_token) 


Below we load the 50-dimensional GloVe embeddings (pretrained on a Wikipedia sub- 
set). When creating the TokenEmbedding instance, the specified embedding file has to be 
downloaded if it was not yet. 


glove_6b5@d = TokenEmbedding(’glove.6b.50d’) 


Downloading ../data/glove.6B.50d.zip from http://d21-data.s3-accelerate. 
~amazonaws.com/glove.6B.50d.zip... 


Output the vocabulary size. The vocabulary contains 400000 words (tokens) and a special 
unknown token. 


len(glove_6b50d) 


400001 


We can get the index of a word in the vocabulary, and vice versa. 


glove_6b50d. token_to_idx['beautiful’], glove_6b5d.idx_to_token[3367] 
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(3367, 'beautiful’) 


15.7.2 Applying Pretrained Word Vectors 


Using the loaded GloVe vectors, we will demonstrate their semantics by applying them in 
the following word similarity and analogy tasks. 


Word Similarity 


Similar to Section 15.4.3, in order to find semantically similar words for an input word 
based on cosine similarities between word vectors, we implement the following knn (k- 
nearest neighbors) function. 


def knn(W, x, k): 
# Add 1e-9 for numerical stability 
cos = torch.mv(W, x.reshape(-1,)) / (Ç 
torch.sqrt(torch.sum(W * W, axis=1) + le-9) * 
torch.sqrt((x * x).sum())) 
_, topk = torch.topk(cos, k=k) 
return topk, [cos[int(i)] for i in topk] 


Then, we search for similar words using the pretrained word vectors from the TokenEm- 
bedding instance embed. 


def get_similar_tokens(query_token, k, embed): 
topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1) 
for i, c in zip(topk[1:], cos[1:]): # Exclude the input word 
print(f'cosine sim={float(c):.3f}: {embed.idx_to_tokenLint(i)]}') 


The vocabulary of the pretrained word vectors in glove_6b5@d contains 400000 words 
and a special unknown token. Excluding the input word and unknown token, among this 
vocabulary let’s find three most semantically similar words to word “chip”. 


get_similar_tokens('chip’, 3, glove_6b50d) 


cosine sim=0.856: chips 
cosine sim=0.749: intel 
cosine sim=0.749: electronics 


Below outputs similar words to “baby” and “beautiful”. 


get_similar_tokens('baby’, 3, glove_6b50d) 


cosine sim=0.839: babies 
cosine sim=0.800: boy 
cosine sim=0.792: girl 
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get_similar_tokens('beautiful’, 3, glove_6b50d) 


cosine sim=0.921: lovely 
cosine sim=0.893: gorgeous 
cosine sim=0.830: wonderful 


Word Analogy 


Besides finding similar words, we can also apply word vectors to word analogy tasks. For 
example, “man”:“woman”::“son”:““daughter” is the form of a word analogy: “man” is to 
“woman” as “son” is to “daughter”. Specifically, the word analogy completion task can be 
defined as: for a word analogy a: b :: c : d, given the first three words a, b and c, find d. 
Denote the vector of word w by vec(w). To complete the analogy, we will find the word 


whose vector is most similar to the result of vec(c) + vec(b) — vec(a). 


def get_analogy(token_a, token_b, token_c, embed): 
vecs = embed[[token_a, token_b, token_c]] 
x = vecs[1] - vecs[@] + vecs[2] 
topk, cos = knn(embed.idx_to_vec, x, 1) 
return embed. idx_to_token[int(topk[0])] # Remove unknown words 


Let’s verify the “male-female” analogy using the loaded word vectors. 


get_analogy(’man'’, ‘woman’, ‘son’, glove_6b5@d) 


‘daughter’ 


99,66 99,66. 9966s 


Below completes a “capital-country” analogy: “beijing”:“china’::“tokyo”:“japan”. This 
demonstrates semantics in the pretrained word vectors. 


get_analogy('’beijing’, ‘china’, ‘tokyo’, glove_6b50d) 


‘japan’ 


For the “adjective-superlative adjective” analogy such as “bad”’:“worst”::“big”:“biggest’, 
we can see that the pretrained word vectors may capture the syntactic information. 


get_analogy(’bad’, ‘worst’, ‘big’, glove_6b5@d) 


'biggest’ 


To show the captured notion of past tense in the pretrained word vectors, we can test the 


99,66 


syntax using the “present tense-past tense” analogy: “do”:“did’”::“go”:“went”. 
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get_analogy('do', ‘did’, ‘go', glove_6b50d) 


"went’ 


15.7.3 Summary 


e In practice, word vectors that are pretrained on large corpora can be applied to down- 
stream natural language processing tasks. 


e Pretrained word vectors can be applied to the word similarity and analogy tasks. 


15.7.4 Exercises 


1. Test the fastText results using TokenEmbedding(’wiki.en’). 


2. When the vocabulary is extremely large, how can we find similar words or complete a 
word analogy faster? 


Discussions 22" . 


15.8 Bidirectional Encoder Representations from 
Transformers (BERT) 


We have introduced several word embedding models for natural language understanding. 
After pretraining, the output can be thought of as a matrix where each row is a vector that 
represents a word of a predefined vocabulary. In fact, these word embedding models are 
all context-independent. Let’s begin by illustrating this property. 


15.8.1 From Context-Independent to Context-Sensitive 


Recall the experiments in Section 15.4 and Section 15.7. For instance, word2vec and GloVe 
both assign the same pretrained vector to the same word regardless of the context of the 
word (if any). Formally, a context-independent representation of any token x is a func- 
tion f(x) that only takes x as its input. Given the abundance of polysemy and complex 
semantics in natural languages, context-independent representations have obvious limita- 
tions. For instance, the word “crane” in contexts “a crane is flying” and “a crane driver 
came” has completely different meanings; thus, the same word may be assigned different 
representations depending on contexts. 


This motivates the development of context-sensitive word representations, where represen- 
tations of words depend on their contexts. Hence, a context-sensitive representation of 
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token x is a function f(x, c(x)) depending on both x and its context c(x). Popular context- 
sensitive representations include TagLM (language-model-augmented sequence tagger) (Pe- 
ters et al., 2017), CoVe (Context Vectors) (McCann et al., 2017), and ELMo (Embeddings 
from Language Models) (Peters et al., 2018). 


For example, by taking the entire sequence as input, ELMo is a function that assigns a rep- 
resentation to each word from the input sequence. Specifically, ELMo combines all the 
intermediate layer representations from pretrained bidirectional LSTM as the output rep- 
resentation. Then the ELMo representation will be added to a downstream task’s existing 
supervised model as additional features, such as by concatenating ELMo representation 
and the original representation (e.g., GloVe) of tokens in the existing model. On the one 
hand, all the weights in the pretrained bidirectional LSTM model are frozen after ELMo 
representations are added. On the other hand, the existing supervised model is specifically 
customized for a given task. Leveraging different best models for different tasks at that 
time, adding ELMo improved the state of the art across six natural language processing 
tasks: sentiment analysis, natural language inference, semantic role labeling, coreference 
resolution, named entity recognition, and question answering. 


15.8.2 From Task-Specific to Task-A gnostic 


Although ELMo has significantly improved solutions to a diverse set of natural language 
processing tasks, each solution still hinges on a task-specific architecture. However, it is 
practically non-trivial to craft a specific architecture for every natural language processing 
task. The GPT (Generative Pre-Training) model represents an effort in designing a general 
task-agnostic model for context-sensitive representations (Radford et al., 2018). Built on 
a Transformer decoder, GPT pretrains a language model that will be used to represent text 
sequences. When applying GPT to a downstream task, the output of the language model 
will be fed into an added linear output layer to predict the label of the task. In sharp contrast 
to ELMo that freezes parameters of the pretrained model, GPT fine-tunes all the parame- 
ters in the pretrained Transformer decoder during supervised learning of the downstream 
task. GPT was evaluated on twelve tasks of natural language inference, question answer- 
ing, sentence similarity, and classification, and improved the state of the art in nine of them 
with minimal changes to the model architecture. 


However, due to the autoregressive nature of language models, GPT only looks forward 
(left-to-right). In contexts “i went to the bank to deposit cash” and “i went to the bank 
to sit down”, as “bank” is sensitive to the context to its left, GPT will return the same 
representation for “bank”, though it has different meanings. 


15.8.3 BERT: Combining the Best of Both Worlds 


As we have seen, ELMo encodes context bidirectionally but uses task-specific architectures; 
while GPT is task-agnostic but encodes context left-to-right. Combining the best of both 
worlds, BERT (Bidirectional Encoder Representations from Transformers) encodes con- 
text bidirectionally and requires minimal architecture changes for a wide range of natural 
language processing tasks (Devlin et al., 2018). Using a pretrained Transformer encoder, 
BERT is able to represent any token based on its bidirectional context. During supervised 
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learning of downstream tasks, BERT is similar to GPT in two aspects. First, BERT rep- 
resentations will be fed into an added output layer, with minimal changes to the model 
architecture depending on nature of tasks, such as predicting for every token vs. predicting 
for the entire sequence. Second, all the parameters of the pretrained Transformer encoder 
are fine-tuned, while the additional output layer will be trained from scratch. Fig. 15.8.1 
depicts the differences among ELMo, GPT, and BERT. 


Label(s) of the task Label(s) of the task Label(s) of the task 
Architecture crafted 
for the given task 
Rep, ... Repy Rep, ... Repy Rep, .... Repy 
Unidirectional 
A architecture x ; 
Ms \ / 
sls Pretraining 

Pretraining & fine-tuning 

Token; ... Tokeny Token, ... Tokeny Token, ... Tokeny 
ELMo GPT BERT 


A comparison of ELMo, GPT, and BERT. 


BERT further improved the state of the art on eleven natural language processing tasks 
under broad categories of (i) single text classification (e.g., sentiment analysis), (ii) text pair 
classification (e.g., natural language inference), (iii) question answering, (iv) text tagging 
(e.g., named entity recognition). All proposed in 2018, from context-sensitive ELMo to 
task-agnostic GPT and BERT, conceptually simple yet empirically powerful pretraining of 
deep representations for natural languages have revolutionized solutions to various natural 
language processing tasks. 


In the rest of this chapter, we will dive into the pretraining of BERT. When natural language 
processing applications are explained in Chapter 16, we will illustrate fine-tuning of BERT 
for downstream applications. 


import torch 
from torch import nn 
from d21 import torch as d21 


15.8.4 Input Representation 


In natural language processing, some tasks (e.g., sentiment analysis) take single text as 
input, while in some other tasks (e.g., natural language inference), the input is a pair of 
text sequences. The BERT input sequence unambiguously represents both single text and 
text pairs. In the former, the BERT input sequence is the concatenation of the special 
classification token “<cls>”, tokens of a text sequence, and the special separation token 
“<sep>”. In the latter, the BERT input sequence is the concatenation of “<cls>”, tokens 
of the first text sequence, “<sep>”, tokens of the second text sequence, and “<sep>”. We 
will consistently distinguish the terminology “BERT input sequence” from other types of 
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“sequences”. For instance, one BERT input sequence may include either one text sequence 
or two text sequences. 


To distinguish text pairs, the learned segment embeddings e, and eg are added to the 
token embeddings of the first sequence and the second sequence, respectively. For single 
text inputs, only e4 is used. 


The following get_tokens_and_segments takes either one sentence or two sentences as 


input, then returns tokens of the BERT input sequence and their corresponding segment 
IDs. 


#@save 
def get_tokens_and_segments(tokens_a, tokens_b=None): 
"""Get tokens of the BERT input sequence and their segment IDs.””” 
tokens = [’<cls>’] + tokens_a + [’<sep>'] 
# @ and 1 are marking segment A and B, respectively 
segments = [0] * (len(tokens_a) + 2) 
if tokens_b is not None: 
tokens += tokens_b + ['<sep>'] 
segments += [1] * (len(tokens_b) + 1) 
return tokens, segments 


BERT chooses the Transformer encoder as its bidirectional architecture. Common in the 
Transformer encoder, positional embeddings are added at every position of the BERT input 
sequence. However, different from the original Transformer encoder, BERT uses learnable 
positional embeddings. To sum up, Fig. 15.8.2 shows that the embeddings of the BERT 
input sequence are the sum of the token embeddings, segment embeddings, and positional 
embeddings. 


Input <cls> this movie is great <sep> i like it <sep> 


+ + + + + + + + + + 
Segment 
Embeddings |_% | |_& Es 2 m ee a 
+ + + + + + + + + + 
Positional 
Embeddings 


The embeddings of the BERT input sequence are the sum of the token embeddings, 
segment embeddings, and positional embeddings. 


Token 
Embeddings 


eccs> 


The following BERTEncoder class is similar to the TransformerEncoder class as imple- 
mented in Section 11.7. Different from TransformerEncoder, BERTEncoder uses segment 
embeddings and learnable positional embeddings. 


#@save 
class BERTEncoder (nn.Module): 
"""BERT encoder.”"” 
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads, 
num_blks, dropout, max_len=1000, **xkwargs): 


(continues on next page) 
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super(BERTEncoder, self).__init__(**kwargs) 

self.token_embedding = nn.Embedding(vocab_size, num_hiddens) 

self .segment_embedding = nn.Embedding(2, num_hiddens) 

self.blks = nn.Sequential() 

for i in range(num_blks): 

self.blks.add_module(f"{i}”, d21.TransformerEncoderBlock( 

num_hiddens, ffn_num_hiddens, num_heads, dropout, True)) 

# In BERT, positional embeddings are learnable, thus we create a 

# parameter of positional embeddings that are long enough 

self.pos_embedding = nn.Parameter(torch.randn(1, max_len, 

num_hiddens) ) 


def forward(self, tokens, segments, valid_lens): 
# Shape of `X` remains unchanged in the following code snippet: 
# (batch size, max sequence length, ‘num_hiddens*) 
X = self.token_embedding(tokens) + self.segment_embedding(segments) 
X = X + self.pos_embeddingL[:, :X.shape[1], :] 
for blk in self.blks: 
X = blk(X, valid_lens) 
return X 


Suppose that the vocabulary size is 10000. To demonstrate forward inference of BERTEn- 
coder, let’s create an instance of it and initialize its parameters. 


vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4 

ffn_num_input, num_blks, dropout = 768, 2, 0.2 

encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, num_heads, 
num_blks, dropout) 


We define tokens to be 2 BERT input sequences of length 8, where each token is an index 
of the vocabulary. The forward inference of BERTEncoder with the input tokens returns 
the encoded result where each token is represented by a vector whose length is predefined 
by the hyperparameter num_hiddens. This hyperparameter is usually referred to as the 
hidden size (number of hidden units) of the Transformer encoder. 


tokens = torch.randint(@, vocab_size, (2, 8)) 

segments = torch.tensor([[0, 2, @, @, 1, 1, 1, 1], [@, @, @, 1, 1, 1, 1, 1]]) 
encoded_X = encoder(tokens, segments, None) 

encoded_X. shape 


torch.Size([2, 8, 768]) 


15.8.5 Pretraining Tasks 


The forward inference of BERTEncoder gives the BERT representation of each token of 
the input text and the inserted special tokens “<cls>” and “<seq>”. Next, we will use 
these representations to compute the loss function for pretraining BERT. The pretraining 
is composed of the following two tasks: masked language modeling and next sentence 
prediction. 
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Masked Language Modeling 


As illustrated in Section 9.3, a language model predicts a token using the context on its 
left. To encode context bidirectionally for representing each token, BERT randomly masks 
tokens and uses tokens from the bidirectional context to predict the masked tokens in a 
self-supervised fashion. This task is referred to as a masked language model. 


In this pretraining task, 15% of tokens will be selected at random as the masked tokens for 
prediction. To predict a masked token without cheating by using the label, one straight- 
forward approach is to always replace it with a special “<mask>” token in the BERT input 
sequence. However, the artificial special token “<mask>” will never appear in fine-tuning. 
To avoid such a mismatch between pretraining and fine-tuning, if a token is masked for 
prediction (e.g., “great” is selected to be masked and predicted in “this movie is great”), in 
the input it will be replaced with: 


e a special “<mask>” token for 80% of the time (e.g., “this movie is great” becomes “this 
movie is <mask>’’); 


e arandom token for 10% of the time (e.g., “this movie is great” becomes “this movie is 
drink”); 


e the unchanged label token for 10% of the time (e.g., “this movie is great” becomes “this 
movie is great”). 


Note that for 10% of 15% time a random token is inserted. This occasional noise encourages 
BERT to be less biased towards the masked token (especially when the label token remains 
unchanged) in its bidirectional context encoding. 


We implement the following MaskLM class to predict masked tokens in the masked language 
model task of BERT pretraining. The prediction uses a one-hidden-layer MLP (self .mlp). 
In forward inference, it takes two inputs: the encoded result of BERTEncoder and the token 
positions for prediction. The output is the prediction results at these positions. 


#@save 
class MaskLM(nn.Module) : 
"""The masked language model task of BERT.””” 
def __init__(self, vocab_size, num_hiddens, **kwargs): 
super(MaskLM, self).__init__(**kwargs) 
self.mlp = nn.Sequential(nn.LazyLinear(num_hiddens) , 
nn.ReLU(), 
nn.LayerNorm(num_hiddens) , 
nn.LazyLinear(vocab_size)) 


def forward(self, X, pred_positions): 
num_pred_positions = pred_positions.shape[1] 
pred_positions = pred_positions.reshape(-1) 
batch_size = X.shape[Q] 
batch_idx = torch.arange(@, batch_size) 
# Suppose that ‘batch_size*‘ = 2, ‘num_pred_positions* = 3, then 
PE patch rdx IG a tOrch tenson LOr On G), ibs iy ibipys 
batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions) 
masked_X = X[batch_idx, pred_positions] 


(continues on next page) 
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masked_X = masked_X.reshape((batch_size, num_pred_positions, -1)) 
mlm_Y_hat = self.mlp(masked_X) 
return mlm_Y_hat 


To demonstrate the forward inference of MaskLM, we create its instance mlm and initialize 
it. Recall that encoded_X from the forward inference of BERTEncoder represents 2 BERT 
input sequences. We define mlm_positions as the 3 indices to predict in either BERT input 
sequence of encoded_X. The forward inference of mlm returns prediction results mlm_Y_hat 
at all the masked positions mlm_positions of encoded_X. For each prediction, the size of 
the result is equal to the vocabulary size. 


mlm = MaskLM(vocab_size, num_hiddens) 

mlm_positions = torch.tensor([[1, 5, 2], [6, 1, 5]]) 
mlm_Y_hat = mlm(encoded_X, mlm_positions) 

mlm_Y_hat. shape 


torch.Size([2, 3, 10000]) 


With the ground truth labels mlm_Y of the predicted tokens mlm_Y_hat under masks, we 
can calculate the cross-entropy loss of the masked language model task in BERT pretrain- 
ing. 


mlm_Y = torch.tensor([L[7, 8, 9], [10, 20, 30]]) 

loss = nn.CrossEntropyLoss(reduction='none’) 

mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1)) 
mlm_1.shape 


torch. Size([6]) 


Next Sentence Prediction 


Although masked language modeling is able to encode bidirectional context for represent- 
ing words, it does not explicitly model the logical relationship between text pairs. To help 
understand the relationship between two text sequences, BERT considers a binary classi- 
fication task, next sentence prediction, in its pretraining. When generating sentence pairs 
for pretraining, for half of the time they are indeed consecutive sentences with the label 
“True”; while for the other half of the time the second sentence is randomly sampled from 
the corpus with the label “False”. 


The following NextSentencePred class uses a one-hidden-layer MLP to predict whether 
the second sentence is the next sentence of the first in the BERT input sequence. Due to 
self-attention in the Transformer encoder, the BERT representation of the special token 
“<cls>” encodes both the two sentences from the input. Hence, the output layer (self. 
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output) of the MLP classifier takes X as input, where X is the output of the MLP hidden 
layer whose input is the encoded “<cls>” token. 


#@save 
class NextSentencePred(nn.Module) : 
"""The next sentence prediction task of BERT.””” 
def __init__(self, **xkwargs): 
super(NextSentencePred, self).__init__(**kwargs) 
self.output = nn.LazyLinear (2) 


def forward(self, X): 
# ‘X* shape: (batch size, ‘num_hiddens*) 
return self .output(X) 


We can see that the forward inference of an NextSentencePred instance returns binary 
predictions for each BERT input sequence. 


# PyTorch by default will not flatten the tensor as seen in mxnet where, if 
# flatten=True, all but the first axis of input data are collapsed together 
encoded_X = torch.flatten(encoded_X, start_dim=1) 

# input_shape for NSP: (batch size, ‘num_hiddens*) 

nsp = NextSentencePred() 

nsp_Y_hat = nsp(encoded_X) 

nsp_Y_hat.shape 


torch.Size([2, 2]) 


The cross-entropy loss of the 2 binary classifications can also be computed. 


= torch.tensor([Q, 1]) 
= loss(nsp_Y_hat, nsp_y) 


nsp_y 
nsp_l 
nsp_l.shape 


torch. Size([2]) 


It is noteworthy that all the labels in both the aforementioned pretraining tasks can be triv- 
ially obtained from the pretraining corpus without manual labeling effort. The original 
BERT has been pretrained on the concatenation of BookCorpus (Zhu et al., 2015) and En- 
glish Wikipedia. These two text corpora are huge: they have 800 million words and 2.5 
billion words, respectively. 


15.8.6 Putting It All Together 


When pretraining BERT, the final loss function is a linear combination of both the loss 
functions for masked language modeling and next sentence prediction. Now we can de- 
fine the BERTModel class by instantiating the three classes BERTEncoder, MaskLM, and 
NextSentencePred. The forward inference returns the encoded BERT representations en- 
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coded_x, predictions of masked language modeling mlm_Y_hat, and next sentence predic- 
tions nsp_Y_hat. 


#@save 
class BERTModel (nn.Module) : 
"""The BERT model. «ui 
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, 
num_heads, num_blks, dropout, max_len=100Q): 
super(BERTModel, self).__init__Q) 
self.encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, 
num_heads, num_blks, dropout, 
max_len=max_len) 
self.hidden = nn.Sequential(nn.LazyLinear(num_hiddens) , 
nn. Tanh()) 
self.mlm = MaskLM(vocab_size, num_hiddens) 
self.nsp = NextSentencePred() 


def forward(self, tokens, segments, valid_lens=None, pred_positions=None) : 
encoded_X = self.encoder(tokens, segments, valid_lens) 
if pred_positions is not None: 
mlm_Y_hat = self.mlm(encoded_X, pred_positions) 
else: 
mlm_Y_hat = None 
# The hidden layer of the MLP classifier for next sentence prediction. 
# Q@ is the index of the '<cls>' token 
nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, ð, :])) 
return encoded_X, mlm_Y_hat, nsp_Y_hat 


15.8.7 Summary 


e Word embedding models such as word2vec and GloVe are context-independent. They 
assign the same pretrained vector to the same word regardless of the context of the 
word (if any). It is hard for them to handle well polysemy or complex semantics in 
natural languages. 


For context-sensitive word representations such as ELMo and GPT, representations of 
words depend on their contexts. 


ELMo encodes context bidirectionally but uses task-specific architectures (however, it is 
practically non-trivial to craft a specific architecture for every natural language pro- 
cessing task); while GPT is task-agnostic but encodes context left-to-right. 


BERT combines the best of both worlds: it encodes context bidirectionally and requires 
minimal architecture changes for a wide range of natural language processing tasks. 


The embeddings of the BERT input sequence are the sum of the token embeddings, 
segment embeddings, and positional embeddings. 


Pretraining BERT is composed of two tasks: masked language modeling and next sen- 
tence prediction. The former is able to encode bidirectional context for representing 
words, while the latter explicitly models the logical relationship between text pairs. 
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15.8.8 Exercises 


1. All other things being equal, will a masked language model require more or fewer pre- 
training steps to converge than a left-to-right language model? Why? 


2. In the original implementation of BERT, the positionwise feed-forward network in BERTEn- 
coder (via d21.TransformerEncoderBlock) and the fully connected layer in MaskLM 
both use the Gaussian error linear unit (GELU) (Hendrycks and Gimpel, 2016) as the 
activation function. Research into the difference between GELU and ReLU. 


Discussions 2?8. 


15.9 The Dataset for Pretraining BERT 
[ÂM] 


To pretrain the BERT model as implemented in Section 15.8, we need to generate the dataset 
in the ideal format to facilitate the two pretraining tasks: masked language modeling and 
next sentence prediction. On the one hand, the original BERT model is pretrained on 
the concatenation of two huge corpora BookCorpus and English Wikipedia (see Section 
15.8.5), making it hard to run for most readers of this book. On the other hand, the off- 
the-shelf pretrained BERT model may not fit for applications from specific domains like 
medicine. Thus, it is getting popular to pretrain BERT on a customized dataset. To facil- 
itate the demonstration of BERT pretraining, we use a smaller corpus WikiText-2 (Merity 
et al., 2016). 


Comparing with the PTB dataset used for pretraining word2vec in Section 15.3, WikiText- 
2 (i) retains the original punctuation, making it suitable for next sentence prediction; (ii) 
retains the original case and numbers; (iii) is over twice larger. 


import os 

import random 

import torch 

from d21 import torch as d21 


In the WikiText-2 dataset, each line represents a paragraph where space is inserted be- 
tween any punctuation and its preceding token. Paragraphs with at least two sentences are 
retained. To split sentences, we only use the period as the delimiter for simplicity. We 
leave discussions of more complex sentence splitting techniques in the exercises at the end 
of this section. 


#@save 

d21.DATA_HUB[ ’wikitext-2'] = ( 
"https: //s3.amazonaws.com/research.metamind.io/wikitext/’ 
"wikitext-2-vl1.zip’, '3c914d17d80b1459be871a5039ac23e752a53cbe') 


#@save 


(continues on next page) 
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def _read_wiki(data_dir): 

file_name = os.path.join(data_dir, ‘wiki.train.tokens’) 
with open(file_name, 'r') as f: 

lines = f.readlines() 
# Uppercase letters are converted to lowercase ones 
paragraphs = [line.strip().lower().split(’ . ') 

for line in lines if len(line.split(’ . ')) >= 2] 

random. shuffle(paragraphs) 
return paragraphs 


15.9.1 Defining Helper Functions for Pretraining Tasks 


In the following, we begin by implementing helper functions for the two BERT pretraining 
tasks: next sentence prediction and masked language modeling. These helper functions 
will be invoked later when transforming the raw text corpus into the dataset of the ideal 
format to pretrain BERT. 


Generating the Next Sentence Prediction Task 


According to descriptions of Section 15.8.5, the _get_next_sentence function generates 
a training example for the binary classification task. 


#@save 
def _get_next_sentence(sentence, next_sentence, paragraphs): 
if random.random() < 0.5: 
is_next = True 
else: 
# ‘paragraphs* is a list of lists of lists 
next_sentence = random. choice(random. choice(paragraphs) ) 
is_next = False 
return sentence, next_sentence, is_next 


The following function generates training examples for next sentence prediction from the 
input paragraph by invoking the _get_next_sentence function. Here paragraph is a list 
of sentences, where each sentence is a list of tokens. The argument max_len specifies the 
maximum length of a BERT input sequence during pretraining. 


#@save 
def _get_nsp_data_from_paragraph(paragraph, paragraphs, vocab, max_len): 
nsp_data_from_paragraph = [] 
for i in range(len(paragraph) - 1): 
tokens_a, tokens_b, is_next = _get_next_sentence( 
paragraph[i], paragraphli + 1], paragraphs) 
# Consider 1 '<cls>’ token and 2 '<sep>’ tokens 
if len(tokens_a) + len(tokens_b) + 3 > max_len: 
continue 
tokens, segments = d21.get_tokens_and_segments(tokens_a, tokens_b) 
nsp_data_from_paragraph. append((tokens, segments, is_next)) 
return nsp_data_from_paragraph 
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Generating the Masked Language Modeling Task 


In order to generate training examples for the masked language modeling task from a BERT 
input sequence, we define the following _replace_mlm_tokens function. In its inputs, to- 
kens is a list of tokens representing a BERT input sequence, candidate_pred_positions 
is a list of token indices of the BERT input sequence excluding those of special tokens (spe- 
cial tokens are not predicted in the masked language modeling task), and num_mlm_preds 
indicates the number of predictions (recall 15% random tokens to predict). Following the 
definition of the masked language modeling task in Section 15.8.5, at each prediction posi- 
tion, the input may be replaced by a special “<mask>” token or a random token, or remain 
unchanged. In the end, the function returns the input tokens after possible replacement, the 
token indices where predictions take place and labels for these predictions. 


#@save 
def _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds, 
vocab): 
# For the input of a masked language model, make a new copy of tokens and 
# replace some of them by '<mask>' or random tokens 
mlm_input_tokens = [token for token in tokens] 
pred_positions_and_labels = [] 
# Shuffle for getting 15% random tokens for prediction in the masked 
# language modeling task 
random. shuffle(candidate_pred_positions) 
for mlm_pred_position in candidate_pred_positions: 
if len(pred_positions_and_labels) >= num_mlm_preds: 
break 
masked_token = None 
# 80% of the time: replace the word with the '<mask>’ token 
if random.random() < 0.8: 
masked_token = '<mask>’ 
else: 
# 10% of the time: keep the word unchanged 
if random.random() < @.5: 
masked_token = tokens[mlm_pred_position] 
# 10% of the time: replace the word with a random word 
else: 
masked_token = random. choice(vocab. idx_to_token) 
mlm_input_tokens[mlm_pred_position] = masked_token 
pred_positions_and_labels. append( 
(mlm_pred_position, tokens[mlm_pred_position])) 
return mlm_input_tokens, pred_positions_and_labels 


By invoking the aforementioned _replace_mlm_tokens function, the following function 
takes a BERT input sequence (tokens) as an input and returns indices of the input tokens 
(after possible token replacement as described in Section 15.8.5), the token indices where 
predictions take place, and label indices for these predictions. 


#@save 

def _get_mlm_data_from_tokens(tokens, vocab): 
candidate_pred_positions = [] 
# ‘tokens: is a list of strings 
for i, token in enumerate(tokens): 


(continues on next page) 
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# Special tokens are not predicted in the masked language modeling 


# task 
if token in [’<cls>', '<sep>’]: 
continue 


candidate_pred_positions.append(i) 
# 15% of random tokens are predicted in the masked language modeling task 
num_mlm_preds = max(1, round(len(tokens) * @.15)) 
mlm_input_tokens, pred_positions_and_labels = _replace_mlm_tokens( 

tokens, candidate_pred_positions, num_mlm_preds, vocab) 
pred_positions_and_labels = sorted(pred_positions_and_labels, 

key=lambda x: x[Q]) 

pred_positions = [v[9] for v in pred_positions_and_labels] 
mlm_pred_labels = [v[1] for v in pred_positions_and_labels] 
return vocab[mlm_input_tokens], pred_positions, vocab[mlm_pred_labels] 


15.9.2 Transforming Text into the Pretraining Dataset 


Now we are almost ready to customize a Dataset class for pretraining BERT. Before that, 
we still need to define a helper function _pad_bert_inputs to append the special “<pad>” 
tokens to the inputs. Its argument examples contain the outputs from the helper func- 
tions _get_nsp_data_from_paragraph and _get_mlm_data_from_tokens for the two 
pretraining tasks. 


#@save 


def 


_pad_bert_inputs(examples, max_len, vocab): 
max_num_mlm_preds = round(max_len x @.15) 
all_token_ids, all_segments, valid_lens, = [], [], O 
all_pred_positions, all_mlm_weights, all_mlm_labels = [], [], [] 
nsp_labels = [] 
for (token_ids, pred_positions, mlm_pred_label_ids, segments, 
is_next) in examples: 
all_token_ids.append(torch. tensor(token_ids + [vocab['<pad>']] * ( 
max_len - len(token_ids)), dtype=torch. long) ) 
all_segments.append(torch.tensor(segments + [9] * ( 
max_len - len(segments)), dtype=torch.long)) 
# ‘valid_lens* excludes count of '<pad>' tokens 
valid_lens.append(torch. tensor(len(token_ids), dtype=torch.float32)) 
all_pred_positions.append(torch.tensor(pred_positions + [9] > ( 
max_num_mlm_preds - len(pred_positions)), dtype=torch.long)) 
# Predictions of padded tokens will be filtered out in the loss via 
# multiplication of @ weights 
all_mlm_weights. append( 
torch.tensor([1.9] * len(mlm_pred_label_ids) + [0.0] * ( 
max_num_mlm_preds - len(pred_positions)), 
dtype=torch. float32)) 
all_mlm_labels.append(torch.tensor(mlm_pred_label_ids + [0] * ( 
max_num_mlm_preds - len(mlm_pred_label_ids)), dtype=torch.long)) 
nsp_labels.append(torch.tensor(is_next, dtype=torch.long)) 
return (all_token_ids, all_segments, valid_lens, all_pred_positions, 
all_mlm_weights, all_mlm_labels, nsp_labels) 


Putting the helper functions for generating training examples of the two pretraining tasks, 
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and the helper function for padding inputs together, we customize the following _Wiki- 
TextDataset class as the WikiText-2 dataset for pretraining BERT. By implementing the 
__getitem__function, we can arbitrarily access the pretraining (masked language model- 
ing and next sentence prediction) examples generated from a pair of sentences from the 
WikiText-2 corpus. 


The original BERT model uses WordPiece embeddings whose vocabulary size is 30000 
(Wu et al., 2016). The tokenization method of WordPiece is a slight modification of the 
original byte pair encoding algorithm in Section 15.6.2. For simplicity, we use the d21. 
tokenize function for tokenization. Infrequent tokens that appear less than five times are 
filtered out. 


#@save 
class _WikiTextDataset(torch.utils.data.Dataset): 
def __init__(self, paragraphs, max_len): 
# Input ‘paragraphs[i]‘ is a list of sentence strings representing a 
# paragraph; while output ‘paragraphs[i]‘ is a list of sentences 
# representing a paragraph, where each sentence is a list of tokens 
paragraphs = [d21.tokenize( 
paragraph, token='word') for paragraph in paragraphs] 
sentences = [sentence for paragraph in paragraphs 
for sentence in paragraph] 
self.vocab = d21.Vocab(sentences, min_freq=5, reserved_tokens=[ 
'<pad>', '<mask>’, ‘'<cls>', '<sep>']) 
# Get data for the next sentence prediction task 
examples = [] 
for paragraph in paragraphs: 
examples.extend(_get_nsp_data_from_paragraph( 
paragraph, paragraphs, self.vocab, max_len)) 
# Get data for the masked language model task 
examples = [(_get_mlm_data_from_tokens(tokens, self.vocab) 
+ (segments, is_next)) 
for tokens, segments, is_next in examples] 
# Pad inputs 
(self.all_token_ids, self.all_segments, self.valid_lens, 
self.all_pred_positions, self.all_mlm_weights, 
self.all_mlm_labels, self.nsp_labels) = _pad_bert_inputs( 
examples, max_len, self.vocab) 


def __getitem__(self, idx): 
return (self.all_token_ids[idx], self.all_segmentsLidx], 
self.valid_lens[idx], self.all_pred_positions[idx], 
self.all_mlm_weightsL[idx], self.all_mlm_labels[idx], 
self .nsp_labels[Lidx]) 


def __len__(self): 
return len(self.all_token_ids) 


By using the _read_wiki function and the _WikiTextDataset class, we define the follow- 
ing load_data_wiki to download and WikiText-2 dataset and generate pretraining exam- 
ples from it. 
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#@save 
def load_data_wiki(batch_size, max_len): 
"""| oad the WikiText-2 dataset.”"” 
num_workers = d21.get_dataloader_workers() 
data_dir = d21.download_extract('’wikitext-2', 'wikitext-2’) 
paragraphs = _read_wiki(data_dir) 
train_set = _WikiTextDataset(paragraphs, max_len) 
train_iter = torch.utils.data.DataLoader(train_set, batch_size, 
shuffle=True, num_workers=num_workers) 
return train_iter, train_set.vocab 


Setting the batch size to 512 and the maximum length of a BERT input sequence to be 
64, we print out the shapes of a minibatch of BERT pretraining examples. Note that in 
each BERT input sequence, 10 (64 x 0.15) positions are predicted for the masked language 
modeling task. 


batch_size, max_len = 512, 64 
train_iter, vocab = load_data_wiki(batch_size, max_len) 


for (tokens_X, segments_X, valid_lens_x, pred_positions_X, mlm_weights_Xx, 
mlm_Y, nsp_y) in train_iter: 
print(tokens_X.shape, segments_X.shape, valid_lens_x.shape, 
pred_positions_X.shape, mlm_weights_X.shape, mlm_Y.shape, 
nsp_y. shape) 
break 


Downloading ../data/wikitext-2-vl.zip from https://s3.amazonaws.com/research. 
<metamind.io/wikitext/wikitext-2-vl.zip... 

torch.Size([512, 64]) torch.Size([512, 64]) torch.Size({512]) torch.Size([512,. 
—10]) torch.Size([512, 10]) torch.Size({[512, 10]) torch.Size([512]) 


In the end, let’s take a look at the vocabulary size. Even after filtering out infrequent tokens, 
it is still over twice larger than that of the PTB dataset. 


len(vocab) 


20256 


15.9.3 Summary 


e Comparing with the PTB dataset, the WikiText-2 dateset retains the original punctuation, 
case and numbers, and is over twice larger. 


e We can arbitrarily access the pretraining (masked language modeling and next sentence 
prediction) examples generated from a pair of sentences from the WikiText-2 corpus. 


15.9.4 Exercises 
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1. For simplicity, the period is used as the only delimiter for splitting sentences. Try other 
sentence splitting techniques, such as the spaCy and NLTK. Take NLTK as an exam- 
ple. You need to install NLTK first: pip install nltk. In the code, first import 
nltk. Then, download the Punkt sentence tokenizer: nltk.download('punkt'). To 


split sentences such as sentences = 'This is great ! Why not ?’, invok- 
ing nltk. tokenize.sent_tokenize(sentences) will return a list of two sentence 
strings: [’This is great !', ‘Why not ?']. 


2. What is the vocabulary size if we do not filter out any infrequent token? 


Discussions?’ , 


15.10 Pretraining BERT 
| 


With the BERT model implemented in Section 15.8 and the pretraining examples generated 
from the WikiText-2 dataset in Section 15.9, we will pretrain BERT on the WikiText-2 
dataset in this section. 


import torch 
from torch import nn 
from d21 import torch as d21 


To start, we load the WikiText-2 dataset as minibatches of pretraining examples for masked 
language modeling and next sentence prediction. The batch size is 512 and the maximum 
length of a BERT input sequence is 64. Note that in the original BERT model, the maximum 
length is 512. 


batch_size, max_len = 512, 64 
train_iter, vocab = d21.load_data_wiki(batch_size, max_len) 


15.10.1 Pretraining BERT 


The original BERT has two versions of different model sizes (Devlin et al., 2018). The base 
model (BERT gasg) uses 12 layers (Transformer encoder blocks) with 768 hidden units 
(hidden size) and 12 self-attention heads. The large model (BERT) arcg) uses 24 layers 
with 1024 hidden units and 16 self-attention heads. Notably, the former has 110 million 
parameters while the latter has 340 million parameters. For demonstration with ease, we 
define a small BERT, using 2 layers, 128 hidden units, and 2 self-attention heads. 


net = d21.BERTModel(len(vocab), num_hiddens=128, 

ffn_num_hiddens=256, num_heads=2, num_blks=2, dropout=0. 2) 
devices = d21.try_all_gpus() 
loss = nn.CrossEntropyLoss() 
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Before defining the training loop, we define a helper function _get_batch_loss_bert. 
Given the shard of training examples, this function computes the loss for both the masked 
language modeling and next sentence prediction tasks. Note that the final loss of BERT 
pretraining is just the sum of both the masked language modeling loss and the next sentence 
prediction loss. 


#@save 
def _get_batch_loss_bert(net, loss, vocab_size, tokens_X, 
segments_X, valid_lens_x, 
pred_positions_X, mlm_weights_X, 
mlm_Y, nsp_y): 
# Forward pass 
_, mlm_Y_hat, nsp_Y_hat = net(tokens_X, segments_X, 
valid_lens_x.reshape(-1), 
pred_positions_X) 
# Compute masked language model loss 
mlm_l = loss(mlm_Y_hat.reshape(-1, vocab_size), mlm_Y.reshape(-1)) *\ 
mlm_weights_X.reshape(-1, 1) 
mlm_l = mlm_l.sum() / (mlm_weights_X.sum() + le-8) 
# Compute next sentence prediction loss 
nsp_l = loss(nsp_Y_hat, nsp_y) 
1 = mlm_l + nsp_l 
return mlm_l, nsp_l, 1 


Invoking the two aforementioned helper functions, the following train_bert function de- 
fines the procedure to pretrain BERT (net) on the WikiText-2 (train_iter) dataset. Train- 
ing BERT can take very long. Instead of specifying the number of epochs for training as in 
the train_ch13 function (see Section 14.1), the input num_steps of the following function 
specifies the number of iteration steps for training. 


def train_bert(train_iter, net, loss, vocab_size, devices, num_steps): 
net («next (iter(train_iter))[:4]) 
net = nn.DataParallel(net, device_ids=devices) .to(devices[@]) 
trainer = torch.optim.Adam(net.parameters(), 1lr=0.01) 
step, timer = ð, d21.Timer() 
animator = d21.Animator(xlabel='step'’, ylabel='loss’, 
xlim=[1, num_steps], legend=[’mlm', ‘nsp']) 
# Sum of masked language modeling losses, sum of next sentence prediction 
# losses, no. of sentence pairs, count 
metric = d21.Accumulator (4) 
num_steps_reached = False 
while step < num_steps and not num_steps_reached: 
for tokens_X, segments_X, valid_lens_x, pred_positions_X,\ 
mlm_weights_X, mlm_Y, nsp_y in train_iter: 
tokens_X = tokens_X.to(devices[0]) 
segments_X = segments_X.to(devices[@]) 
valid_lens_x = valid_lens_x.to(devices[0]) 
pred_positions_X = pred_positions_X.to(devices[Q@]) 
mlm_weights_X = mlm_weights_X.to(devices[0]) 
mlm_Y, nsp_y = mlm_Y.to(devicesl0]), nsp_y.to(devices[@]) 
trainer.zero_grad() 
timer.start() 
mlm_l, nsp_l, 1 = _get_batch_loss_bert( 


(continues on next page) 
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net, loss, vocab_size, tokens_X, segments_X, valid_lens_x, 
pred_positions_X, mlm_weights_X, mlm_Y, nsp_y) 
1. backward() 
trainer.step() 
metric.add(mlm_l, nsp_l, tokens_X.shape[@], 1) 
timer.stop() 
animator.add(step + 1, 
(metric[@] / metric[3], metricl1] / metric[3])) 


step += 1 

if step == num_steps: 
num_steps_reached = True 
break 


print(f'’MLM loss {metric[@] / metric[3]:.3f}, ' 
f'NSP loss {metric[1] / metric(l3]:.3f}’) 
print(f'{metricl2] / timer.sum():.1f} sentence pairs/sec on ' 
f'{str(devices) }') 


We can plot both the masked language modeling loss and the next sentence prediction loss 
during BERT pretraining. 


train_bert(train_iter, net, loss, len(vocab), devices, 50) 


MLM loss 5.885, NSP loss 0.760 
4413.2 sentence pairs/sec on [device(type='cuda’, index=@), device(type='cuda’, 
= index=1)] 


loss 


15.10.2 Representing Text with BERT 


After pretraining BERT, we can use it to represent single text, text pairs, or any token in 
them. The following function returns the BERT (net) representations for all tokens in 
tokens_a and tokens_b. 


def get_bert_encoding(net, tokens_a, tokens_b=None) : 
tokens, segments = d21.get_tokens_and_segments(tokens_a, tokens_b) 
token_ids = torch.tensor(vocab[tokens], device=devices[@]).unsqueeze(Q) 
segments = torch.tensor(segments, device=devices[0]) .unsqueeze(Q) 


(continues on next page) 
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valid_len = torch. tensor(len(tokens), device=devices[0]).unsqueeze(Q) 
encoded_X, _, _ = net(token_ids, segments, valid_len) 
return encoded_X 


Consider the sentence “a crane is flying”. Recall the input representation of BERT as dis- 
cussed in Section 15.8.4. After inserting special tokens “<cls>” (used for classification) 
and “<sep>” (used for separation), the BERT input sequence has a length of six. Since 
zero is the index of the “<cls>” token, encoded_text[:, ©, :] is the BERT represen- 
tation of the entire input sentence. To evaluate the polysemy token “crane”, we also print 
out the first three elements of the BERT representation of the token. 


tokens_a = ['a’, ‘crane’, 'is', 'flying'] 
encoded_text = get_bert_encoding(net, tokens_a) 
fe UOMGMSs “selles’., “elo, orenean 1S", wwii", “sscor” 


encoded_text_cls = encoded_textL:, ð, :] 
encoded_text_crane = encoded_text[:, 2, :] 
encoded_text.shape, encoded_text_cls.shape, encoded_text_crane[@][:3] 


(torch.Size([1, 6, 128]), 
torch.Size([1, 128]), 
tensor([0.8414, 1.4830, 0.8226], device='cuda:0', grad_fn=<SliceBackwardQ>) ) 


Now consider a sentence pair “a crane driver came” and “he just left”. Similarly, encoded_pairL: , 
©, :] is the encoded result of the entire sentence pair from the pretrained BERT. Note that 

the first three elements of the polysemy token “crane” are different from those when the 
context is different. This supports that BERT representations are context-sensitive. 


tokens_a, tokens_b = ['a’, ‘crane’, ‘driver’, 'came’], [’he’, ‘just’, ‘left'] 
encoded_pair = get_bert_encoding(net, tokens_a, tokens_b) 
# Tokens: '<cls>', ‘a’, 'crane', ‘driver’, 'came', '<sep>', ‘he’, ‘just’, 


# ‘left’, '<sep>’ 

encoded_pair_cls = encoded_pairL[:, ð, :] 

encoded_pair_crane = encoded_pairL[:, 2, :] 

encoded_pair.shape, encoded_pair_cls.shape, encoded_pair_crane[@][:3] 


(torch.Size([1, 10, 128]), 
torch.Size([1, 128]), 
tensor([0.0430, 1.6132, 0.0437], device='cuda:0', grad_fn=<SliceBackwardQ>) ) 


In Chapter 16, we will fine-tune a pretrained BERT model for downstream natural language 
processing applications. 


15.10.3 Summary 


e The original BERT has two versions, where the base model has 110 million parameters 
and the large model has 340 million parameters. 
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e After pretraining BERT, we can use it to represent single text, text pairs, or any token in 
them. 


e Inthe experiment, the same token has different BERT representation when their contexts 
are different. This supports that BERT representations are context-sensitive. 


15.10.4 Exercises 


1. In the experiment, we can see that the masked language modeling loss is significantly 


higher than the next sentence prediction loss. Why? 


2. Set the maximum length of a BERT input sequence to be 512 (same as the original BERT 
model). Use the configurations of the original BERT model such as BERTLarcr. Do 
you encounter any error when running this section? Why? 


Discussions 2“°. 


Fig. 16.1 
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Applications 


We have seen how to represent tokens in text sequences and train their representations in 
Chapter 15. Such pretrained text representations can be fed to various models for different 
downstream natural language processing tasks. 


In fact, earlier chapters have already discussed some natural language processing applica- 
tions without pretraining, just for explaining deep learning architectures. For instance, in 
Chapter 9, we have relied on RNNs to design language models to generate novella-like text. 
In Chapter 10 and Chapter 11, we have also designed models based on RNNs and attention 
mechanisms for machine translation. 


However, this book does not intend to cover all such applications in a comprehensive man- 
ner. Instead, our focus is on how to apply (deep) representation learning of languages to 
addressing natural language processing problems. Given pretrained text representations, 
this chapter will explore two popular and representative downstream natural language pro- 
cessing tasks: sentiment analysis and natural language inference, which analyze single text 
and relationships of text pairs, respectively. 


r——------------------ i 
Apisai . | Ea A l 
l 
E are ee eee a ee es se 4 
rN 
7 rietetiettetiatietientienttatientinntiantinttantiantionttentianttan i 
Architecture | MLP | CNN | RNN “| 
SaaS ee SSeS f Ses ee 1 
Poo Ta a Tee PO ep ea ee ` 
1 
ii Subword H 
1 
Pretraining i word2vec | GloVe embedding BERT ! 
4 


Pretrained text representations can be fed to various deep learning architectures for 
different downstream natural language processing applications. This chapter focuses on 
how to design models for different downstream natural language processing applications. 


As depicted in Fig. 16.1, this chapter focuses on describing the basic ideas of designing nat- 
ural language processing models using different types of deep learning architectures, such 
as MLPs, CNNs, RNNs, and attention. Though it is possible to combine any pretrained 
text representations with any architecture for either application in Fig. 16.1, we select a few 
representative combinations. Specifically, we will explore popular architectures based on 
RNNs and CNNs for sentiment analysis. For natural language inference, we choose atten- 
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tion and MLPs to demonstrate how to analyze text pairs. In the end, we introduce how to 
fine-tune a pretrained BERT model for a wide range of natural language processing appli- 
cations, such as on a sequence level (single text classification and text pair classification) 
and a token level (text tagging and question answering). As a concrete empirical case, we 
will fine-tune BERT for natural language inference. 


As we have introduced in Section 15.8, BERT requires minimal architecture changes for a 
wide range of natural language processing applications. However, this benefit comes at the 
cost of fine-tuning a huge number of BERT parameters for the downstream applications. 
When space or time is limited, those crafted models based on MLPs, CNNs, RNNs, and 
attention are more feasible. In the following, we start by the sentiment analysis application 
and illustrate the model design based on RNNs and CNNs, respectively. 


16.1 Sentiment Analysis and the Dataset 
i —_| 


With the proliferation of online social media and review platforms, a plethora of opinion- 
ated data has been logged, bearing great potential for supporting decision making processes. 
Sentiment analysis studies people’s sentiments in their produced text, such as product re- 
views, blog comments, and forum discussions. It enjoys wide applications to fields as 
diverse as politics (e.g., analysis of public sentiments towards policies), finance (e.g., anal- 
ysis of sentiments of the market), and marketing (e.g., product research and brand manage- 
ment). 


Since sentiments can be categorized as discrete polarities or scales (e.g., positive and neg- 
ative), we can consider sentiment analysis as a text classification task, which transforms a 
varying-length text sequence into a fixed-length text category. In this chapter, we will use 
Stanford’s large movie review dataset?4! for sentiment analysis. It consists of a training set 
and a testing set, either containing 25000 movie reviews downloaded from IMDb. In both 
datasets, there are equal number of “positive” and “negative” labels, indicating different 
sentiment polarities. 


import os 

import torch 

from torch import nn 

from d21 import torch as d21 


16.1.1 Reading the Dataset 


First, download and extract this IMDb review dataset in the path . ./data/aclImdb. 


#@save 
d21.DATA_HUBL’aclImdb’] = (d21.DATA_URL + ‘aclImdb_vl.tar.gz', 
"Q@1ada507287d82875905620988597833ad4e0903') 


(continues on next page) 
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data_dir = d21.download_extract('aclImdb', '‘aclImdb’) 


Downloading ../data/aclImdb_v1.tar.gz from http://d21-data.s3-accelerate. 
<amazonaws.com/aclImdb_v1.tar.gz... 


Next, read the training and test datasets. Each example is a review and its label: 1 for 
“positive” and 0 for “negative”. 


#@save 
def read_imdb(data_dir, is_train): 
"""Read the IMDb review dataset text sequences and labels. 
data, labels = [], [] 
for label in ('pos’, 'neg’): 
folder_name = os.path.join(data_dir, ‘train’ if is_train else ‘test’, 
label) 
for file in os.listdir(folder_name) : 
with open(os.path. join(folder_name, file), ‘rb’') as f: 


nnn 


review = f.read().decode('utf-8').replace('\n', '') 
data. append(review) 
labels.append(1 if label == ‘pos’ else Q) 


return data, labels 


train_data = read_imdb(data_dir, is_train=True) 

print(’# trainings:', len(train_data[@])) 

for x, y in zip(train_data[0][:3], train_data[1][:3]): 
print(’label:’, y, 'review:', x[:60]) 


# trainings: 25000 

label: 1 review: Zentropa has much in common with The Third Man, another noir 
label: 1 review: Zentropa is the most original movie I've seen in years. If y 
label: 1 review: Lars Von Trier is never backward in trying out new technique 


16.1.2 Preprocessing the Dataset 


Treating each word as a token and filtering out words that appear less than 5 times, we 
create a vocabulary out of the training dataset. 


train_tokens = d21.tokenize(train_datalQ], token='word’) 
vocab = d21.Vocab(train_tokens, min_freq=5, reserved_tokens=[’<pad>']) 


After tokenization, let’s plot the histogram of review lengths in tokens. 


d21.set_figsize() 

d21.plt.xlabel('# tokens per review’) 

d21.plt.ylabel(' count’) 

d21.plt.hist(Llen(line) for line in train_tokens], bins=range(@, 1000, 5Q)); 
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0 200 400 600 800 
# tokens per review 


As we expected, the reviews have varying lengths. To process a minibatch of such reviews 
at each time, we set the length of each review to 500 with truncation and padding, which is 
similar to the preprocessing step for the machine translation dataset in Section 10.5. 


num_steps = 500 # sequence length 
train_features = torch. tensor(({d21.truncate_pad( 

vocab[line], num_steps, vocab['<pad>']) for line in train_tokens]) 
print(train_features. shape) 


torch. Size([25000, 500]) 


16.1.3 Creating Data Iterators 


Now we can create data iterators. At each iteration, a minibatch of examples are returned. 


train_iter = d21.load_array((train_features, torch.tensor(train_dataLl1])), 64) 


for X, y in train_iter: 
print(’X:', X.shape, 
break 

print(’# batches:'’, len(train_iter)) 


, y:', y.shape) 


X: torch.Size([64, 500]) , y: torch.Size([64]) 
# batches: 391 


16.1.4 Putting It All Together 


Last, we wrap up the above steps into the load_data_imdb function. It returns training 
and test data iterators and the vocabulary of the IMDb review dataset. 


#@save 
def load_data_imdb(batch_size, num_steps=50Q): 
"""Return data iterators and the vocabulary of the IMDb review dataset. 
data_dir = d21.download_extract(’aclimdb', ‘acliImdb’') 
train_data = read_imdb(data_dir, True) 
test_data = read_imdb(data_dir, False) 


non 


(continues on next page) 
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train_tokens = d21.tokenize(train_datalQ], token='word’') 
test_tokens = d2]1.tokenize(test_data[@], token='word’) 
vocab = d21.Vocab(train_tokens, min_freq=5) 
train_features = torch.tensor((d21.truncate_pad( 
vocab[line], num_steps, vocab['<pad>']) for line in train_tokens]) 
test_features = torch. tensor([d21.truncate_pad( 
vocab[line], num_steps, vocab['<pad>’]) for line in test_tokens]) 
train_iter = d21.load_array((train_features, torch.tensor(train_datal1])), 
batch_size) 
test_iter = d21.load_array((test_features, torch. tensor(test_data[1])), 
batch_size, 
is_train=False) 
return train_iter, test_iter, vocab 


16.1.5 Summary 


e Sentiment analysis studies people’s sentiments in their produced text, which is consid- 
ered as a text classification problem that transforms a varying-length text sequence 
into a fixed-length text category. 


e After preprocessing, we can load Stanford’s large movie review dataset (IMDb review 
dataset) into data iterators with a vocabulary. 


16.1.6 Exercises 


1. What hyperparameters in this section can we modify to accelerate training sentiment 
analysis models? 


2 


2. Can you implement a function to load the dataset of Amazon reviews 74? into data 


iterators and labels for sentiment analysis? 


Discussions 2*°. 


16.2 Sentiment Analysis: Using Recurrent Neural 
Networks 


Like word similarity and analogy tasks, we can also apply pretrained word vectors to sen- 
timent analysis. Since the IMDb review dataset in Section 16.1 is not very big, using text 
representations that were pretrained on large-scale corpora may reduce overfitting of the 
model. As a specific example illustrated in Fig. 16.2.1, we will represent each token using 
the pretrained GloVe model, and feed these token representations into a multilayer bidi- 
rectional RNN to obtain the text sequence representation, which will be transformed into 
sentiment analysis outputs (Maas et al., 2011). For the same downstream application, we 
will consider a different architectural choice later. 
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This section feeds pretrained GloVe to an RNN-based architecture for sentiment analysis. 


import torch 
from torch import nn 
from d21 import torch as d21 


batch_size = 64 
train_iter, test_iter, vocab = d21.load_data_imdb(batch_size) 


16.2.1 Representing Single Text with RNNs 


In text classifications tasks, such as sentiment analysis, a varying-length text sequence will 


be transformed into fixed-length categories. In the following BiRNN class, while each token 
of a text sequence gets its individual pretrained GloVe representation via the embedding 
layer (self .embedding), the entire sequence is encoded by a bidirectional RNN (self. 
encoder). More concretely, the hidden states (at the last layer) of the bidirectional LSTM 
at both the initial and final time steps are concatenated as the representation of the text 
sequence. This single text representation is then transformed into output categories by a 
fully connected layer (self . decoder) with two outputs (“positive” and “negative”’). 


class BiRNN(nn.Module) : 


def 


def 


_init__(self, vocab_size, embed_size, num_hiddens, 
num_layers, **kwargs): 

super(BiRNN, self).__init__(**kwargs) 

self.embedding = nn.Embedding(vocab_size, embed_size) 

# Set ‘bidirectional* to True to get a bidirectional RNN 

self.encoder = nn.LSTM(embed_size, num_hiddens, num_layers=num_layers, 
bidirectional=True) 

self.decoder = nn.Linear(4 * num_hiddens, 2) 


forward(self, inputs): 

# The shape of ‘inputs: is (batch size, no. of time steps). Because 

# LSTM requires its input’s first dimension to be the temporal 

# dimension, the input is transposed before obtaining token 

# representations. The output shape is (no. of time steps, batch size, 
# word vector dimension) 

embeddings = self.embedding(inputs.T) 
self.encoder.flatten_parameters() 

# Returns hidden states of the last hidden layer at different time 


(continues on next page) 
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# steps. The shape of ‘outputs* is (no. of time steps, batch size, 

# 2 x no. of hidden units) 

outputs, _ = self.encoder (embeddings) 

# Concatenate the hidden states at the initial and final time steps as 
# the input of the fully connected layer. Its shape is (batch size, 

# 4 x no. of hidden units) 

encoding = torch.cat((outputs[0], outputs[-1]), dim=1) 

outs = self.decoder (encoding) 

return outs 


Let’s construct a bidirectional RNN with two hidden layers to represent single text for sen- 
timent analysis. 


embed_size, num_hiddens, num_layers, devices = 100, 100, 2, d21.try_all_gpus() 
net = BiRNN(len(vocab), embed_size, num_hiddens, num_layers) 


def init_weights(module): 
if type(module) == nn.Linear: 
nn.init.xavier_uniform_(module. weight) 
if type(module) == nn.LSTM: 
for param in module._flat_weights_names: 
if "weight" in param: 
nn.init.xavier_uniform_(module._parameters[param]) 
net.apply(init_weights) ; 


16.2.2 Loading Pretrained Word Vectors 


Below we load the pretrained 100-dimensional (needs to be consistent with embed_size) 
GloVe embeddings for tokens in the vocabulary. 


glove_embedding = d21.TokenEmbedding(’glove.6b.10@d') 


Print the shape of the vectors for all the tokens in the vocabulary. 


embeds = glove_embeddingLvocab. idx_to_token] 
embeds. shape 


torch. Size([49346, 100]) 


We use these pretrained word vectors to represent tokens in the reviews and will not update 
these vectors during training. 


net.embedding.weight.data.copy_(embeds) 
net.embedding.weight.requires_grad = False 


16.2.3 Training and Evaluating the Model 
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Now we can train the bidirectional RNN for sentiment analysis. 


lr, num_epochs = 0.01, 5 

trainer = torch.optim.Adam(net.parameters(), lr=1r) 

loss = nn.CrossEntropyLoss(reduction="none” 

d21.train_chl3(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.277, train acc 0.884, test acc 0.861 
2608.4 examples/sec on [device(type='cuda’, index=@), device(type='cuda’,. 
<index=1)] 
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We define the following function to predict the sentiment of a text sequence using the trained 
model net. 


#@save 

def predict_sentiment(net, vocab, sequence): 
"""Predict the sentiment of a text sequence. 
sequence = torch. tensor(vocab[sequence.split()], device=d21.try_gpu()) 
label = torch.argmax(net(sequence.reshape(1, -1)), dim=1) 
return ‘positive’ if label == 1 else ‘negative’ 


nnn 


Finally, let’s use the trained model to predict the sentiment for two simple sentences. 


predict_sentiment(net, vocab, ‘this movie is so great’) 


‘positive’ 


predict_sentiment(net, vocab, ‘this movie is so bad’) 


‘negative’ 


16.2.4 Summary 


e Pretrained word vectors can represent individual tokens in a text sequence. 


752 


Natural Language Processing: Applications 


e Bidirectional RNNs can represent a text sequence, such as via the concatenation of its 
hidden states at the initial and final time steps. This single text representation can be 
transformed into categories using a fully connected layer. 


16.2.5 Exercises 


1. Increase the number of epochs. Can you improve the training and testing accuracies? 
How about tuning other hyperparameters? 


2. Use larger pretrained word vectors, such as 300-dimensional GloVe embeddings. Does 
it improve classification accuracy? 


3. Can we improve the classification accuracy by using the spaCy tokenization? You need 
to install spaCy (pip install spacy) and install the English package (python -m 
spacy download en). In the code, first, import spaCy (import spacy). Then, load 
the spaCy English package (spacy_en = spacy.load('en’)). Finally, define the 
function def tokenizer(text): return [tok.text for tok in spacy_en. 
tokenizer(text)] and replace the original tokenizer function. Note the different 
forms of phrase tokens in GloVe and spaCy. For example, the phrase token “new york” 
takes the form of “new-york” in GloVe and the form of “new york” after the spaCy 
tokenization. 


Discussions 2**. 


16.3 Sentiment Analysis: Using Convolutional 
Neural Networks 


In Chapter 7, we investigated mechanisms for processing two-dimensional image data with 
two-dimensional CNNs, which were applied to local features such as adjacent pixels. Though 
originally designed for computer vision, CNNs are also widely used for natural language 
processing. Simply put, just think of any text sequence as a one-dimensional image. In this 
way, one-dimensional CNNs can process local features such as n-grams in text. 


In this section, we will use the textCNN model to demonstrate how to design a CNN ar- 
chitecture for representing single text (Kim, 2014). Compared with Fig. 16.2.1 that uses 
an RNN architecture with GloVe pretraining for sentiment analysis, the only difference in 
Fig. 16.3.1 lies in the choice of the architecture. 


import torch 
from torch import nn 
from d21 import torch as d21 


batch_size = 64 
train_iter, test_iter, vocab = d21.load_data_imdb(batch_size) 
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| This section feeds pretrained Glo Ve to a CNN-based architecture for sentiment analysis. 


16.3.1 One-Dimensional Convolutions 


Before introducing the model, let’s see how a one-dimensional convolution works. Bear 
in mind that it is just a special case of a two-dimensional convolution based on the cross- 
correlation operation. 


Input Kernel Output 


o]1 2);3/4)]5)]6 * iu || 3 = 2/5) 8) 11) 14) 17 


J | One-dimensional cross-correlation operation. The shaded portions are the first output 


element as well as the input and kernel tensor elements used for the output computation: 
Ox1+1x2=2. 


As shown in Fig. 16.3.2, in the one-dimensional case, the convolution window slides from 
left to right across the input tensor. During sliding, the input subtensor (e.g., O and 1 in 
Fig. 16.3.2) contained in the convolution window at a certain position and the kernel tensor 
(e.g., 1 and 2 in Fig. 16.3.2) are multiplied elementwise. The sum of these multiplications 
gives the single scalar value (e.g.,0 x 1+ 1 x2 = 2 in Fig. 16.3.2) at the corresponding 
position of the output tensor. 


We implement one-dimensional cross-correlation in the following corr1d function. Given 
an input tensor X and a kernel tensor K, it returns the output tensor Y. 


def corri1d(X, K): 
w = K.shapeLQ] 
Y = torch.zeros((X.shapel0] - w + 1)) 
for i in range(Y.shape[Q]): 
YCi] = (XLi: i + w] * K).sum() 
return Y 


We can construct the input tensor X and the kernel tensor K from Fig. 16.3.2 to validate the 
output of the above one-dimensional cross-correlation implementation. 


X, K = torch.tensor(L[0, 1, 2, 3, 4, 5, 6]), torch.tensor([1, 2]) 
corrld(X, K) 


754 


Natural Language Processing: Applications 


tensor(L 2., 5., 8., 11., 14., 17.]) 


For any one-dimensional input with multiple channels, the convolution kernel needs to have 
the same number of input channels. Then for each channel, perform a cross-correlation 
operation on the one-dimensional tensor of the input and the one-dimensional tensor of 
the convolution kernel, summing the results over all the channels to produce the one- 
dimensional output tensor. Fig. 16.3.3 shows a one-dimensional cross-correlation oper- 
ation with 3 input channels. 


Input Kernel Output 


= 2 | 8 |14 |20 |26 | 32 


One-dimensional cross-correlation operation with 3 input channels. The shaded portions 


are the first output element as well as the input and kernel tensor elements used for the 
output computation: Ox 1+1x2+1Xx3+2x4+2x (-1)+3x(-3)=2. 


We can implement the one-dimensional cross-correlation operation for multiple input chan- 
nels and validate the results in Fig. 16.3.3. 


def corrid_multi_in(X, K): 
# First, iterate through the @th dimension (channel dimension) of ‘X* and 
# ‘K*. Then, add them together 
return sum(corrid(x, k) for x, k in zip(X, K)) 


X = torch.tensor([L@, 1, 2, 3, 4, 5, 6], 

(Cale, ae, le, Ze de Zila 

2, By @ Bs Bs Wy SUID 
K = torch.tensor([L1, 2], [3, 4], [-1, -3]]) 
corrid_multi_in(X, K) 


tensor(L 2., 8., 14., 20., 26., 32.]) 


Note that multi-input-channel one-dimensional cross-correlations are equivalent to single- 
input-channel two-dimensional cross-correlations. To illustrate, an equivalent form of the 
multi-input-channel one-dimensional cross-correlation in Fig. 16.3.3 is the single-input- 
channel two-dimensional cross-correlation in Fig. 16.3.4, where the height of the convolu- 
tion kernel has to be the same as that of the input tensor. 


Both the outputs in Fig. 16.3.2 and Fig. 16.3.3 have only one channel. Same as two- 
dimensional convolutions with multiple output channels described in Section 7.4.2, we 
can also specify multiple output channels for one-dimensional convolutions. 


16.3.2 Max-Over-Time Pooling 


Similarly, we can use pooling to extract the highest value from sequence representations as 
the most important feature across time steps. The max-over-time pooling used in textCNN 
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Output 


== 2 | 8 | 14] 20 | 26 | 32 


Two-dimensional cross-correlation operation with a single input channel. The shaded 
portions are the first output element as well as the input and kernel tensor elements used 
for the output computation: 2 x (—1) +3 x (-3)+1x3+2x4+0x14+1x2=2. 


works like the one-dimensional global max-pooling (Collobert et al., 2011). For a multi- 
channel input where each channel stores values at different time steps, the output at each 
channel is the maximum value for that channel. Note that the max-over-time pooling allows 
different numbers of time steps at different channels. 


16.3.3 The textCNN Model 


Using the one-dimensional convolution and max-over-time pooling, the textCNN model 
takes individual pretrained token representations as input, then obtains and transforms se- 
quence representations for the downstream application. 


For a single text sequence with n tokens represented by d-dimensional vectors, the width, 
height, and number of channels of the input tensor are n, 1, and d, respectively. The 
textCNN model transforms the input into the output as follows: 


1. Define multiple one-dimensional convolution kernels and perform convolution opera- 
tions separately on the inputs. Convolution kernels with different widths may capture 
local features among different numbers of adjacent tokens. 


2. Perform max-over-time pooling on all the output channels, and then concatenate all the 
scalar pooling outputs as a vector. 


3. Transform the concatenated vector into the output categories using the fully connected 
layer. Dropout can be used for reducing overfitting. 


Fig. 16.3.5 illustrates the model architecture of textCNN with a concrete example. The input 
is a sentence with 11 tokens, where each token is represented by a 6-dimensional vectors. 
So we have a 6-channel input with width 11. Define two one-dimensional convolution 
kernels of widths 2 and 4, with 4 and 5 output channels, respectively. They produce 4 output 
channels with width 11—2+1 = 10 and 5 output channels with width 11—4+1 = 8. Despite 
different widths of these 9 channels, the max-over-time pooling gives a concatenated 9- 
dimensional vector, which is finally transformed into a 2-dimensional output vector for 
binary sentiment predictions. 


Defining the Model 


We implement the textCNN model in the following class. Compared with the bidirectional 
RNN model in Section 16.2, besides replacing recurrent layers with convolutional layers, 
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5 i The model architecture of textCNN. 


we also use two embedding layers: one with trainable weights and the other with fixed 
weights. 


class TextCNN(nn.Module): 
def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels, 
*xkwargs) : 

super(TextCNN, self).__init__(**kwargs) 

self.embedding = nn.Embedding(vocab_size, embed_size) 

# The embedding layer not to be trained 

self.constant_embedding = nn.Embedding(vocab_size, embed_size) 

self.dropout = nn.Dropout(@.5) 

self.decoder = nn.Linear(sum(num_channels), 2) 

# The max-over-time pooling layer has no parameters, so this instance 

# can be shared 

self.pool = nn.AdaptiveAvgPool1d(1) 

self.relu = nn.ReLU() 

# Create multiple one-dimensional convolutional layers 

self.convs = nn.ModuleList() 

for c, k in zip(num_channels, kernel_sizes): 
self.convs.append(nn.Convid(2 * embed_size, c, k)) 


def forward(self, inputs): 
# Concatenate two embedding layer outputs with shape (batch size, no. 
# of tokens, token vector dimension) along vectors 
embeddings = torch.cat(( 

self.embedding(inputs), self.constant_embedding(inputs)), dim=2) 

# Per the input format of one-dimensional convolutional layers, 
# rearrange the tensor so that the second dimension stores channels 
embeddings = embeddings.permute(@, 2, 1) 
# For each one-dimensional convolutional layer, after max-over-time 
# pooling, a tensor of shape (batch size, no. of channels, 1) is 
# obtained. Remove the last dimension and concatenate along channels 
encoding = torch.cat(L 


(continues on next page) 
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torch. squeeze(self.relu(self.pool(conv(embeddings))), dim=-1) 
for conv in self.convs], dim=1) 

outputs = self.decoder(self .dropout (encoding) ) 

return outputs 


Let’s create a textCNN instance. It has 3 convolutional layers with kernel widths of 3, 4, 
and 5, all with 100 output channels. 


embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100] 
devices = d21.try_all_gpus() 
net = TextCNN(len(vocab), embed_size, kernel_sizes, nums_channels) 


def init_weights(module): 
if type(module) in (nn.Linear, nn.Convid): 
nn.init.xavier_uniform_(module. weight) 


net.apply(init_weights) ; 


Loading Pretrained Word Vectors 


Same as Section 16.2, we load pretrained 100-dimensional GloVe embeddings as the ini- 
tialized token representations. These token representations (embedding weights) will be 
trained in embedding and fixed in constant_embedding. 


glove_embedding = d21.TokenEmbedding('glove.6b.100d') 
embeds = glove_embeddingLvocab. idx_to_token] 
net.embedding.weight.data.copy_(embeds) 
net.constant_embedding.weight.data.copy_(embeds) 
net.constant_embedding.weight.requires_grad = False 


Training and Evaluating the Model 


Now we can train the textCNN model for sentiment analysis. 


lr, num_epochs = 0.001, 5 

trainer = torch.optim.Adam(net.parameters(), 1lr=1r) 

loss = nn.CrossEntropyLoss(reduction="none" 

d21.train_chl3(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.066, train acc 0.979, test acc 0.868 
4354.2 examples/sec on [device(type='cuda’, index=0), device(type='cuda’,. 
<index=1)] 


Below we use the trained model to predict the sentiment for two simple sentences. 


d21.predict_sentiment(net, vocab, ‘this movie is so great’) 
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‘positive’ 


d21.predict_sentiment(net, vocab, ‘this movie is so bad’) 


‘negative’ 


16.3.4 Summary 


One-dimensional CNNs can process local features such as n-grams in text. 


Multi-input-channel one-dimensional cross-correlations are equivalent to single-input- 
channel two-dimensional cross-correlations. 


The max-over-time pooling allows different numbers of time steps at different channels. 


The textCNN model transforms individual token representations into downstream appli- 
cation outputs using one-dimensional convolutional layers and max-over-time pooling 
layers. 


16.3.5 Exercises 


1. Tune hyperparameters and compare the two architectures for sentiment analysis in Sec- 
tion 16.2 and in this section, such as in classification accuracy and computational effi- 
ciency. 


2. Can you further improve the classification accuracy of the model by using the methods 
introduced in the exercises of Section 16.2? 


3. Add positional encoding in the input representations. Does it improve the classification 
accuracy? 


Discussions“. 
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16.4 Natural Language Inference and the Dataset 


In Section 16.1, we discussed the problem of sentiment analysis. This task aims to clas- 
sify a single text sequence into predefined categories, such as a set of sentiment polarities. 
However, when there is a need to decide whether one sentence can be inferred form an- 
other, or eliminate redundancy by identifying sentences that are semantically equivalent, 
knowing how to classify one text sequence is insufficient. Instead, we need to be able to 
reason over pairs of text sequences. 


16.4.1 Natural Language Inference 


Natural language inference studies whether a hypothesis can be inferred from a premise, 
where both are a text sequence. In other words, natural language inference determines the 
logical relationship between a pair of text sequences. Such relationships usually fall into 
three types: 


e Entailment: the hypothesis can be inferred from the premise. 
e Contradiction: the negation of the hypothesis can be inferred from the premise. 
e Neutral: all the other cases. 


Natural language inference is also known as the recognizing textual entailment task. For 
example, the following pair will be labeled as entailment because “showing affection” in 
the hypothesis can be inferred from “hugging one another” in the premise. 


Premise: Two women are hugging each other. 


Hypothesis: Two women are showing affection. 


The following is an example of contradiction as “running the coding example” indicates 
“not sleeping” rather than “sleeping”. 


Premise: A man is running the coding example from Dive into Deep Learning. 


Hypothesis: The man is sleeping. 


The third example shows a neutrality relationship because neither “famous” nor “not fa- 
mous” can be inferred from the fact that “are performing for us”. 


Premise: The musicians are performing for us. 


Hypothesis: The musicians are famous. 


Natural language inference has been a central topic for understanding natural language. 
It enjoys wide applications ranging from information retrieval to open-domain question 
answering. To study this problem, we will begin by investigating a popular natural language 
inference benchmark dataset. 
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16.4.2 The Stanford Natural Language Inference (SNLI) Dataset 


Stanford Natural Language Inference (SNLI) Corpus is a collection of over 500000 labeled 
English sentence pairs (Bowman et al., 2015). We download and store the extracted SNLI 
dataset in the path . ./data/snli_1.2. 


import os 

import re 

import torch 

from torch import nn 

from d21 import torch as d21 


#@save 

d21.DATA_HUBL'SNLI’] = ( 
"https://nlp.stanford.edu/projects/snli/snli_1.0.zip’, 
'9fcdeQ7509c7e87ec61c640c1b2753d9041758e4') 


data_dir = d21.download_extract(’SNLI’) 


Reading the Dataset 


The original SNLI dataset contains much richer information than what we really need in 
our experiments. Thus, we define a function read_sn1i to only extract part of the dataset, 
then return lists of premises, hypotheses, and their labels. 


#@save 
def read_snli(data_dir, is_train): 
"""Read the SNLI dataset into premises, hypotheses, and labels. 
def extract_text(s): 
# Remove information that will not be used by us 


nnn 


S= re sub ETS) 

S= re Sub 97, S) 

# Substitute two or more consecutive whitespace with space 
s = re.sub('’\\s{2,}'’, TS) 


return s.strip() 
label_set = {’entailment': @, ‘contradiction’: 1, ‘neutral’: 2} 
file_name = os.path.join(data_dir, ‘snli_1.@_train.txt' 

if is_train else 'snli_1.Q_test.txt’) 

with open(file_name, 'r') as f: 

rows = Lrow.split(’\t') for row in f.readlines()[1:]] 
premises = [extract_text(rowL1]) for row in rows if rowlQ] in label_set] 
hypotheses = [extract_text(rowLl2]) for row in rows if row[ð] in label_set] 
labels = [label_set[rowl0]] for row in rows if rowlQ] in label_set] 
return premises, hypotheses, labels 


Now let’s print the first 3 pairs of premise and hypothesis, as well as their labels (“0”, “1”, 


99 66 


and “2” correspond to “entailment’, “contradiction”, and “neutral”, respectively ). 


train_data = read_snli(data_dir, is_train=True) 
for x@, x1, y in zip(train_data[0][:3], train_data[1][:3], train_data[2][:3]): 
print(’premise:', x0) 


(continues on next page) 
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print(’hypothesis:', x1) 
print('label:’, y) 


premise: A person on a horse jumps over a broken down airplane . 
hypothesis: A person is training his horse for a competition . 
label: 2 

premise: A person on a horse jumps over a broken down airplane . 
hypothesis: A person is at a diner , ordering an omelette . 
label: 1 

premise: A person on a horse jumps over a broken down airplane . 
hypothesis: A person is outdoors , on a horse . 

label: @ 


The training set has about 550000 pairs, and the testing set has about 10000 pairs. The fol- 
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lowing shows that the three labels “entailment’, “contradiction”, and “neutral” are balanced 
in both the training set and the testing set. 


test_data = read_snli(data_dir, is_train=False) 
for data in [train_data, test_data]: 
print([[row for row in data[2]].count(i) for i in range(3)]) 


[183416, 183187, 182764] 
[3368, 3237, 3219] 


Defining a Class for Loading the Dataset 


Below we define a class for loading the SNLI dataset by inheriting from the Dataset class 
in Gluon. The argument num_steps in the class constructor specifies the length of a text 
sequence so that each minibatch of sequences will have the same shape. In other words, 
tokens after the first num_steps ones in longer sequence are trimmed, while special tokens 
“<pad>” will be appended to shorter sequences until their length becomes num_steps. By 
implementing the __getitem__ function, we can arbitrarily access the premise, hypothesis, 
and label with the index idx. 


#@save 
class SNLIDataset(torch.utils.data.Dataset): 
"""< customized dataset to load the SNLI dataset.””” 
def __init__(self, dataset, num_steps, vocab=None): 
self.num_steps = num_steps 
all_premise_tokens = d21.tokenize(dataset[0]) 
all_hypothesis_tokens = d21.tokenize(dataset[1]) 
if vocab is None: 
self.vocab = d21.Vocab(all_premise_tokens + all_hypothesis_tokens, 
min_freq=5, reserved_tokens=[’<pad>']) 
else: 
self.vocab = vocab 
self.premises = self._pad(all_premise_tokens) 


(continues on next page) 


762 


Natural Language Processing: Applications 


(continued from previous page) 


self .hypotheses = self._pad(all_hypothesis_tokens) 
self.labels = torch. tensor(dataset[2]) 
print(’read ' + str(len(self.premises)) + 


examples') 


def _pad(self, lines): 
return torch. tensor(({d21.truncate_pad( 
self.vocab[line], self.num_steps, self.vocab['<pad>’]) 
for line in lines]) 


def __getitem__(self, idx): 
return (self.premisesLidx], self.hypotheses[idx]), self.labelsLidx] 


def __len__(self): 
return len(self.premises) 


Putting It All Together 


Now we can invoke the read_sn1i function and the SNLIDataset class to download the 
SNLI dataset and return DataLoader instances for both training and testing sets, together 
with the vocabulary of the training set. It is noteworthy that we must use the vocabulary 
constructed from the training set as that of the testing set. As a result, any new token from 
the testing set will be unknown to the model trained on the training set. 


#@save 
def load_data_snli(batch_size, num_steps=50): 
"""Download the SNLI dataset and return data iterators and vocabulary. 
num_workers = d21.get_dataloader_workers() 
data_dir = d21.download_extract(’SNLI’) 
train_data = read_snli(data_dir, True) 
test_data = read_snli(data_dir, False) 
train_set = SNLIDataset(train_data, num_steps) 
test_set = SNLIDataset(test_data, num_steps, train_set.vocab) 
train_iter = torch.utils.data.DataLoader(train_set, batch_size, 
shuffle=True, 
num_workers=num_workers) 
test_iter = torch.utils.data.DataLoader(test_set, batch_size, 
shuffle=False, 
num_workers=num_workers) 
return train_iter, test_iter, train_set.vocab 


nnn 


Here we set the batch size to 128 and sequence length to 50, and invoke the 1oad_data_snli 
function to get the data iterators and vocabulary. Then we print the vocabulary size. 


train_iter, test_iter, vocab = load_data_snli(128, 50) 
len(vocab) 


read 549367 examples 
read 9824 examples 
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Now we print the shape of the first minibatch. Contrary to sentiment analysis, we have two 
inputs X[@] and X[1] representing pairs of premises and hypotheses. 


for X, Y in train_iter: 
print(XL0].shape) 
print(X[1]. shape) 
print(Y. shape) 
break 


torch.Size([128, 50]) 
torch.Size([128, 50]) 
torch. Size([128]) 


16.4.3 Summary 


e Natural language inference studies whether a hypothesis can be inferred from a premise, 
where both are a text sequence. 


e In natural language inference, relationships between premises and hypotheses include 
entailment, contradiction, and neutral. 


e Stanford Natural Language Inference (SNLD Corpus is a popular benchmark dataset of 
natural language inference. 


16.4.4 Exercises 


1. Machine translation has long been evaluated based on superficial n-gram matching be- 
tween an output translation and a ground-truth translation. Can you design a measure 
for evaluating machine translation results by using natural language inference? 


2. How can we change hyperparameters to reduce the vocabulary size? 
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16.5 Natural Language Inference: Using Attention 
—=—= eee) 


We introduced the natural language inference task and the SNLI dataset in Section 16.4. In 
view of many models that are based on complex and deep architectures, Parikh et al. (2016) 
proposed to address natural language inference with attention mechanisms and called it a 
“decomposable attention model”. This results in a model without recurrent or convolutional 
layers, achieving the best result at the time on the SNLI dataset with much fewer parameters. 
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In this section, we will describe and implement this attention-based method (with MLPs) 
for natural language inference, as depicted in Fig. 16.5.1. 


natural language inference. 


16.5.1 The Model 


Simpler than preserving the order of tokens in premises and hypotheses, we can just align 
tokens in one text sequence to every token in the other, and vice versa, then compare and 
aggregate such information to predict the logical relationships between premises and hy- 
potheses. Similar to alignment of tokens between source and target sentences in machine 
translation, the alignment of tokens between premises and hypotheses can be neatly accom- 
plished by attention mechanisms. 


Concat 
ad TS 
Aggregate --- sum then concat - - - -> Sum Sum 


i-i i-i 
- word - aligned words - -> do - am - 
need - tired - need sleep 
sleep - tired 
ein ae align -------- > i am tired 
i 
i 
do 
: A am 
Premise Hypothesis need 
tired 
i do need sleep iam tired sleep 


2 | Natural language inference using attention mechanisms. 


Fig. 16.5.2 depicts the natural language inference method using attention mechanisms. Ata 
high level, it consists of three jointly trained steps: attending, comparing, and aggregating. 
We will illustrate them step by step in the following. 


import torch 

from torch import nn 

from torch.nn import functional as F 
from d21 import torch as d21 
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Attending 


The first step is to align tokens in one text sequence to each token in the other sequence. 
Suppose that the premise is “i do need sleep” and the hypothesis is “i am tired”. Due to 
in the premise, 
and align “tired” in the hypothesis with “sleep” in the premise. Likewise, we may wish 
to align “i” in the premise with “i” in the hypothesis, and align “need” and “sleep” in the 
premise with “tired” in the hypothesis. Note that such alignment is soft using weighted 


66399 


semantical similarity, we may wish to align “i” in the hypothesis with 


66599 
1 


66599 
1 


average, where ideally large weights are associated with the tokens to be aligned. For ease 
of demonstration, Fig. 16.5.2 shows such alignment in a hard way. 


Now we describe the soft alignment using attention mechanisms in more detail. Denote 
by A = (aj,..., am) and B = (b;,..., bn) the premise and hypothesis, whose number 
of tokens are m and n, respectively, where a;,b; € R (= 1,...,m,j =1,...,n)isa 
d-dimensional word vector. For soft alignment, we compute the attention weights e;; € R 
as 


eij = f(a)" f(b;), (16.5.1) 


where the function f is an MLP defined in the following mlp function. The output dimen- 
sion of f is specified by the num_hiddens argument of mlp. 


def mlp(num_inputs, num_hiddens, flatten): 

net = [] 

net.append(nn.Dropout(@. 2)) 

net.append(nn.Linear(num_inputs, num_hiddens) ) 

net .append(nn.ReLU()) 

if flatten: 
net.append(nn.Flatten(start_dim=1)) 

net.append(nn.Dropout (2. 2)) 

net.append(nn.Linear(num_hiddens, num_hiddens)) 

net .append(nn.ReLU()) 

if flatten: 
net.append(nn.Flatten(start_dim=1)) 

return nn.Sequential (net) 


It should be highlighted that, in (16.5.1) f takes inputs a; and b; separately rather than 
takes a pair of them together as input. This decomposition trick leads to only m+n applica- 
tions (linear complexity) of f rather than mn applications (quadratic complexity). 


Normalizing the attention weights in (16.5.1), we compute the weighted average of all 
the token vectors in the hypothesis to obtain representation of the hypothesis that is softly 
aligned with the token indexed by 7 in the premise: 


B= exp(éi;) 


~ 4 Epa plen) (16.5.2) 


Likewise, we compute soft alignment of premise tokens for each token indexed by j in the 
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hypothesis: 


m 


exp(ei;) 
Qa; = 


= om oe ai. 16.5.3 
24 5 plen) ee 


Below we define the Attend class to compute the soft alignment of hypotheses (beta) with 
input premises A and soft alignment of premises (alpha) with input hypotheses B. 


class Attend(nn.Module): 
def __init__(self, num_inputs, num_hiddens, **kwargs): 
super(Attend, self).__init__(**kwargs) 
self.f = mlp(num_inputs, num_hiddens, flatten=False) 


def forward(self, A, B): 
# Shape of `A`/`B`: (‘batch_size*, no. of tokens in sequence A/B, 
# ‘embed_size*) 
# Shape of ‘f_A‘/‘f_B*: (‘batch_size‘, no. of tokens in sequence A/B, 
# ‘num_hiddens‘*) 
f_A = self.f(A) 
f_B = self.f(B) 
Shape of ‘e*: (‘batch_size*, no. of tokens in sequence A, 
no. of tokens in sequence B) 
= torch.bmm(f_A, f_B.permute(@, 2, 1)) 
Shape of ‘beta‘: (‘batch_size‘, no. of tokens in sequence A, 
‘“embed_size*‘), where sequence B is softly aligned with each token 
(axis 1 of ‘beta‘) in sequence A 
beta = torch.bmm(F.softmax(e, dim=-1), B) 
# Shape of ‘alpha‘: (‘batch_size*, no. of tokens in sequence B, 
# ‘embed_size*‘), where sequence A is softly aligned with each token 
# (axis 1 of ‘alpha‘) in sequence B 
alpha = torch. bmm(F.softmax(e.permute(®, 2, 1), dim=-1), A) 
return beta, alpha 


th tk tk @D HH 


Comparing 


In the next step, we compare a token in one sequence with the other sequence that is softly 
aligned with that token. Note that in soft alignment, all the tokens from one sequence, 
though with probably different attention weights, will be compared with a token in the 
other sequence. For easy of demonstration, Fig. 16.5.2 pairs tokens with aligned tokens 
in a hard way. For example, suppose that the attending step determines that “need” and 
“sleep” in the premise are both aligned with “tired” in the hypothesis, the pair “tired—need 
sleep” will be compared. 


In the comparing step, we feed the concatenation (operator [-,-]) of tokens from one se- 
quence and aligned tokens from the other sequence into a function g (an MLP): 


VA,i = g([a;, 6;]), i = l,...,m 


; (16.5.4) 
vg, j =8([b;;æ;]) j =1,...,n. 


In (16.5.4), v4, is the comparison between token i in the premise and all the hypothesis 
tokens that are softly aligned with token i; while vg, ; is the comparison between token j in 
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the hypothesis and all the premise tokens that are softly aligned with token j. The following 
Compare class defines such as comparing step. 


class Compare(nn.Module): 
def __init__(self, num_inputs, num_hiddens, **kwargs): 


super(Compare, self).__init__(**kwargs) 
self.g = mlp(num_inputs, num_hiddens, flatten=False) 


def forward(self, A, B, beta, alpha): 
V_A = self.g(torch.cat([A, beta], dim=2)) 
V_B = self.g(torch.cat([B, alpha], dim=2)) 
return V_A, V_B 


Aggregating 


With two sets of comparison vectors v4; (i = 1,...,m) and vg; (j = 1,...,n) on hand, 
in the last step we will aggregate such information to infer the logical relationship. We 
begin by summing up both sets: 


VA= > VAis VB= X vai. (16.5.5) 
=] Ja 


Next we feed the concatenation of both summarization results into function A (an MLP) to 
obtain the classification result of the logical relationship: 


¥ = A([va, vg]). (16.5.6) 
The aggregation step is defined in the following Aggregate class. 


class Aggregate(nn.Module): 
def __init__(self, num_inputs, num_hiddens, num_outputs, **kwargs): 
super (Aggregate, self).__init__(**kwargs) 
self.h = mlp(num_inputs, num_hiddens, flatten=True) 
self.linear = nn.Linear(num_hiddens, num_outputs) 


def forward(self, V_A, V_B): 
# Sum up both sets of comparison vectors 
V_A = V_A.sum(dim=1) 
V_B = V_B.sum(dim=1) 
# Feed the concatenation of both summarization results into an MLP 
Y_hat = self.linear(self.h(torch.cat([V_A, V_B], dim=1))) 
return Y_hat 


Putting It All Together 


By putting the attending, comparing, and aggregating steps together, we define the decom- 
posable attention model to jointly train these three steps. 
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class DecomposableAttention(nn.Module) : 
def __init__(self, vocab, embed_size, num_hiddens, num_inputs_attend=100, 
num_inputs_compare=200, num_inputs_agg=400, **kwargs): 

super (DecomposableAttention, self).__init__(**kwargs) 

self.embedding = nn.Embedding(len(vocab), embed_size) 

self.attend = Attend(num_inputs_attend, num_hiddens) 

self.compare = Compare(num_inputs_compare, num_hiddens) 

# There are 3 possible outputs: entailment, contradiction, and neutral 
self aggregate = Aggregate(num_inputs_agg, num_hiddens, num_outputs=3) 


def forward(self, X): 
premises, hypotheses = X 
A = self.embedding(premises) 
B = self.embedding(hypotheses) 
beta, alpha = self.attend(A, B) 
V_A, V_B = self.compare(A, B, beta, alpha) 
Y_hat = self.aggregate(V_A, V_B) 
return Y_hat 


16.5.2 Training and Evaluating the Model 


Now we will train and evaluate the defined decomposable attention model on the SNLI 
dataset. We begin by reading the dataset. 


Reading the dataset 


We download and read the SNLI dataset using the function defined in Section 16.4. The 
batch size and sequence length are set to 256 and 50, respectively. 


batch_size, num_steps = 256, 50 
train_iter, test_iter, vocab = d21.load_data_snli(batch_size, num_steps) 


Downloading ../data/snli_1.0.zip from https://nlp.stanford.edu/projects/snli/ 
osnli_1.@.zip... 

read 549367 examples 

read 9824 examples 


Creating the Model 


We use the pretrained 100-dimensional GloVe embedding to represent the input tokens. 
Thus, we predefine the dimension of vectors a; and b; in (16.5.1) as 100. The output 
dimension of functions f in (16.5.1) and g in (16.5.4) is set to 200. Then we create a 
model instance, initialize its parameters, and load the GloVe embedding to initialize vectors 
of input tokens. 


embed_size, num_hiddens, devices = 100, 200, d21.try_all_gpus() 
net = DecomposableAttention(vocab, embed_size, num_hiddens) 


(continues on next page) 
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(continued from previous page) 


glove_embedding = d21.TokenEmbedding('glove.6b.10@d’) 
embeds = glove_embedding[vocab. idx_to_token] 
net.embedding.weight.data.copy_(embeds) ; 


Downloading ../data/glove.6B.100d.zip from http://d21-data.s3-accelerate. 
~amazonaws.com/glove.6B.100d.zip... 


Training and Evaluating the Model 


In contrast to the sp1it_batch function in Section 13.5 that takes single inputs such as text 
sequences (or images), we define a split_batch_multi_inputs function to take multiple 
inputs such as premises and hypotheses in minibatches. 


Now we can train and evaluate the model on the SNLI dataset. 


lr, num_epochs = @.001, 4 

trainer = torch.optim.Adam(net.parameters(), lr=1r) 

loss = nn.CrossEntropyLoss(reduction="none") 

d21.train_chl3(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.496, train acc 0.805, test acc 0.828 
20383.2 examples/sec on [device(type='cuda’, index=0), device(type='cuda',. 
<index=1)] 


—— train loss 
0.24. --- train acc 
—-- test acc 
0.0 T T T f T 
1.0 15 2.0 2.5 3.0 3.5 40 
epoch 
Using the Model 


Finally, define the prediction function to output the logical relationship between a pair of 
premise and hypothesis. 


#@save 

def predict_snli(net, vocab, premise, hypothesis): 
"""Predict the logical relationship between the premise and hypothesis. 
net.eval() 
premise = torch. tensor(vocab[premise], device=d21.try_gpu()) 


nnn 


(continues on next page) 
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(continued from previous page) 


hypothesis = torch.tensor(vocab[hypothesis], device=d21.try_gpu()) 
label = torch.argmax(net(Lpremise.reshape((1, -1)), 
hypothesis.reshape((1, -1))]), dim=1) 
return 'entailment’ if label == @ else ‘contradiction’ if label == 1 \ 
else ‘neutral’ 


We can use the trained model to obtain the natural language inference result for a sample 
pair of sentences. 


predictesnli (net. vocab. ii he = 1s +) 00d: -ssibel hers) Se.) (Dadi TART) 


‘contradiction’ 


16.5.3 Summary 


The decomposable attention model consists of three steps for predicting the logical rela- 
tionships between premises and hypotheses: attending, comparing, and aggregating. 


With attention mechanisms, we can align tokens in one text sequence to every token in 
the other, and vice versa. Such alignment is soft using weighted average, where ideally 
large weights are associated with the tokens to be aligned. 


The decomposition trick leads to a more desirable linear complexity than quadratic com- 
plexity when computing attention weights. 


e We can use pretrained word vectors as the input representation for downstream natural 
language processing task such as natural language inference. 


16.5.4 Exercises 


1. Train the model with other combinations of hyperparameters. Can you get better accu- 
racy on the test set? 


2. What are major drawbacks of the decomposable attention model for natural language 
247 inference? 


3. Suppose that we want to get the level of semantical similarity (e.g., a continuous value 
between 0 and 1) for any pair of sentences. How shall we collect and label the dataset? 
Can you design a model with attention mechanisms? 


Discussions 24" . 
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16.6 Fine-Tuning BERT for Sequence-Level and 
Token-Level Applications 


In the previous sections of this chapter, we have designed different models for natural lan- 
guage processing applications, such as based on RNNs, CNNs, attention, and MLPs. These 
models are helpful when there is space or time constraint, however, crafting a specific model 
for every natural language processing task is practically infeasible. In Section 15.8, we in- 
troduced a pretraining model, BERT, that requires minimal architecture changes for a wide 
range of natural language processing tasks. On the one hand, at the time of its proposal, 
BERT improved the state of the art on various natural language processing tasks. On the 
other hand, as noted in Section 15.10, the two versions of the original BERT model come 
with 110 million and 340 million parameters. Thus, when there are sufficient computational 
resources, we may consider fine-tuning BERT for downstream natural language processing 
applications. 


In the following, we generalize a subset of natural language processing applications as 
sequence-level and token-level. On the sequence level, we introduce how to transform the 
BERT representation of the text input to the output label in single text classification and 
text pair classification or regression. On the token level, we will briefly introduce new ap- 
plications such as text tagging and question answering and shed light on how BERT can 
represent their inputs and get transformed into output labels. During fine-tuning, the “mini- 
mal architecture changes” required by BERT across different applications are the extra fully 
connected layers. During supervised learning of a downstream application, parameters of 
the extra layers are learned from scratch while all the parameters in the pretrained BERT 
model are fine-tuned. 


16.6.1 Single Text Classification 


Single text classification takes a single text sequence as input and outputs its classification 
result. Besides sentiment analysis that we have studied in this chapter, the Corpus of Lin- 
guistic Acceptability (CoLA) is also a dataset for single text classification, judging whether 
a given sentence is grammatically acceptable or not (Warstadt et al., 2019). For instance, 
“T should study.” is acceptable but “I should studying.” is not. 


Section 15.8 describes the input representation of BERT. The BERT input sequence unam- 
biguously represents both single text and text pairs, where the special classification token 
“<cls>” is used for sequence classification and the special classification token “<sep>” 
marks the end of single text or separates a pair of text. As shown in Fig. 16.6.1, in single 
text classification applications, the BERT representation of the special classification token 
“<cls>” encodes the information of the entire input text sequence. As the representation of 
the input single text, it will be fed into a small MLP consisting of fully connected (dense) 
layers to output the distribution of all the discrete label values. 


16.6.2 Text Pair Classification or Regression 
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Label 
Repeiss Rep, Rep; Reps Rep, Reps Reps —- REP sep> 
BERT 
<cls> Token, Token; Token; Token, Tokens Tokeng <sep> 


Ses SA ` ‘ of Lett 
Se ie OM E oer 
SOS WY eae 


Single text sequence 


Fine-tuning BERT for single text classification applications, such as sentiment analysis 
and testing linguistic acceptability. Suppose that the input single text has six tokens. 


We have also examined natural language inference in this chapter. It belongs to text pair 
classification, a type of application classifying a pair of text. 


Taking a pair of text as input but outputting a continuous value, semantic textual similarity 
is a popular text pair regression task. This task measures semantic similarity of sentences. 
For instance, in the Semantic Textual Similarity Benchmark dataset, the similarity score of 
a pair of sentences is an ordinal scale ranging from 0 (no meaning overlap) to 5 (meaning 
equivalence) (Cer et al., 2017). The goal is to predict these scores. Examples from the 
Semantic Textual Similarity Benchmark dataset include (sentence 1, sentence 2, similarity 
score): 


e “A plane is taking off.”, “An air plane is taking off.”, 5.000; 
e “A woman is eating something.”, “A woman is eating meat.”, 3.000; 


e “A woman is dancing.”, “A man is talking.”, 0.000. 


Label 
Repcgs> Rep, Rep,  R€P<sep> Reps Rep, Reps  RepP<sep> 
BERT 
<cls> Token, Token; <sep> Token Token, Token; <sep> 
\ ’ ` 1 P 
\ r Me ' at 
N + ~% I La 
Text sequence 1 Text sequence 2 


Fine-tuning BERT for text pair classification or regression applications, such as natural 
language inference and semantic textual similarity. Suppose that the input text pair has 
two and three tokens. 


Comparing with single text classification in Fig. 16.6.1, fine-tuning BERT for text pair 
classification in Fig. 16.6.2 is different in the input representation. For text pair regression 
tasks such as semantic textual similarity, trivial changes can be applied such as outputting 
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a continuous label value and using the mean squared loss: they are common for regres- 
sion. 


16.6.3 Text Tagging 


Now let’s consider token-level tasks, such as text tagging, where each token is assigned a 
label. Among text tagging tasks, part-of-speech tagging assigns each word a part-of-speech 
tag (e.g., adjective and determiner) according to the role of the word in the sentence. For 
example, according to the Penn Treebank II tag set, the sentence “John Smith ’s car is 
new” should be tagged as “NNP (noun, proper singular) NNP POS (possessive ending) 
NN (noun, singular or mass) VB (verb, base form) JJ (adjective). 


Label Label Label Label Label Label 

A A 

Parameters 
of Dense 

are shared 

A A 

REP <cis> Rep, Rep, Rep; Rep, Reps Repe REP <sep> 
A A A 
BERT 
<cls> Token; Token; Token; Token, Tokens Tokeng <sep> 


Single text sequence 


| Fine-tuning BERT for text tagging applications, such as part-of-speech tagging. Suppose 


that the input single text has six tokens. 


Fine-tuning BERT for text tagging applications is illustrated in Fig. 16.6.3. Comparing 
with Fig. 16.6.1, the only distinction lies in that in text tagging, the BERT representation 
of every token of the input text is fed into the same extra fully connected layers to output 
the label of the token, such as a part-of-speech tag. 


16.6.4 Question Answering 


As another token-level application, question answering reflects capabilities of reading com- 
prehension. For example, the Stanford Question Answering Dataset (SQuAD v1.1) consists 
of reading passages and questions, where the answer to every question is just a segment of 
text (text span) from the passage that the question is about (Rajpurkar et al., 2016). To 
explain, consider a passage “Some experts report that a mask’s efficacy is inconclusive. 
However, mask makers insist that their products, such as N95 respirator masks, can guard 
against the virus.” and a question “Who say that N95 respirator masks can guard against 
the virus?”. The answer should be the text span “mask makers” in the passage. Thus, the 
goal in SQUAD v1.1 is to predict the start and end of the text span in the passage given a 
pair of question and passage. 


To fine-tune BERT for question answering, the question and passage are packed as the first 
and second text sequence, respectively, in the input of BERT. To predict the position of the 
start of the text span, the same additional fully connected layer will transform the BERT 
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Label 
Parameters of Dense arg 


are shared for the 
same label value 


Repcas> Rep, Rep)  ReP<se> Rep; Rep, Reps  Rep<sep> 
BERT 
<cls> Token, Token; <sep> Token; Token, Token; <sep> 
` ‘ x i) a 
\ ti Sa ' ae 
N 4 es Tow 
Question Passage 


| Fine-tuning BERT for question answering. Suppose that the input text pair has two and 


three tokens. 


representation of any token from the passage of position / into a scalar score s;. Such scores 
of all the passage tokens are further transformed by the softmax operation into a probability 
distribution, so that each token position / in the passage is assigned a probability p; of being 
the start of the text span. Predicting the end of the text span is the same as above, except that 
parameters in its additional fully connected layer are independent from those for predicting 
the start. When predicting the end, any passage token of position i is transformed by the 
same fully connected layer into a scalar score e;. Fig. 16.6.4 depicts fine-tuning BERT for 
question answering. 


For question answering, the supervised learning’s training objective is as straightforward as 
maximizing the log-likelihoods of the ground-truth start and end positions. When predict- 
ing the span, we can compute the score s; + e; for a valid span from position i to position 
j (i < j), and output the span with the highest score. 


16.6.5 Summary 


e BERT requires minimal architecture changes (extra fully connected layers) for sequence- 
level and token-level natural language processing applications, such as single text clas- 
sification (e.g., sentiment analysis and testing linguistic acceptability), text pair classi- 
fication or regression (e.g., natural language inference and semantic textual similarity), 
text tagging (e.g., part-of-speech tagging), and question answering. 


e During supervised learning of a downstream application, parameters of the extra layers 
are learned from scratch while all the parameters in the pretrained BERT model are 
fine-tuned. 


16.6.6 Exercises 


1. Let’s design a search engine algorithm for news articles. When the system receives an 
query (e.g., “oil industry during the coronavirus outbreak”), it should return a ranked 
list of news articles that are most relevant to the query. Suppose that we have a huge pool 
of news articles and a large number of queries. To simplify the problem, suppose that 
the most relevant article has been labeled for each query. How can we apply negative 
sampling (see Section 15.2.1) and BERT in the algorithm design? 
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2. How can we leverage BERT in training language models? 


3. Can we leverage BERT in machine translation? 


Discussions 248. 


16.7 Natural Language Inference: Fine-Tuning 
BERT 


In earlier sections of this chapter, we have designed an attention-based architecture (in 
Section 16.5) for the natural language inference task on the SNLI dataset (as described 
in Section 16.4). Now we revisit this task by fine-tuning BERT. As discussed in Section 
16.6, natural language inference is a sequence-level text pair classification problem, and 
fine-tuning BERT only requires an additional MLP-based architecture, as illustrated in Fig. 
16.7.1. 


See soso eb eee ee eee ete see ee eee eeesu, 
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| This section feeds pretrained BERT to an MLP-based architecture for natural language 
inference. 


In this section, we will download a pretrained small version of BERT, then fine-tune it for 
natural language inference on the SNLI dataset. 


import json 

import multiprocessing 
import os 

import torch 

from torch import nn 

from d21 import torch as d21 


16.7.1 Loading Pretrained BERT 


We have explained how to pretrain BERT on the WikiText-2 dataset in Section 15.9 and 
Section 15.10 (note that the original BERT model is pretrained on much bigger corpora). 
As discussed in Section 15.10, the original BERT model has hundreds of millions of param- 
eters. In the following, we provide two versions of pretrained BERT: “bert.base” is about 
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as big as the original BERT base model that requires a lot of computational resources to 
fine-tune, while “bert.small” is a small version to facilitate demonstration. 


d21.DATA_HUB[’bert.base’] = (d21.DATA_URL + 'bert.base.torch.zip’, 
"225d66f04cae318b841a13d32af3accl65f253ac’) 

d21.DATA_HUB[’bert.small’] = (d21.DATA_URL + ‘bert.small.torch.zip’, 
"€72329e68a7 32bef0452e4b96a1c341c8910F81Ff’) 


Either pretrained BERT model contains a “vocab.json” file that defines the vocabulary set 
and a “pretrained.params” file of the pretrained parameters. We implement the following 
load_pretrained_model function to load pretrained BERT parameters. 


def load_pretrained_model(pretrained_model, num_hiddens, ffn_num_hiddens, 
num_heads, num_blks, dropout, max_len, devices): 
data_dir = d21.download_extract(pretrained_model) 
# Define an empty vocabulary to load the predefined vocabulary 
vocab = d21.Vocab() 
vocab. idx_to_token = json.load(open(os.path.join(data_dir, ‘vocab.json’))) 
vocab. token_to_idx = {token: idx for idx, token in enumerate( 
vocab. idx_to_token) } 
bert = d21.BERTModel ( 
len(vocab), num_hiddens, ffn_num_hiddens=ffn_num_hiddens, num_heads=4, 
num_blks=2, dropout=@.2, max_len=max_len) 
# Load pretrained BERT parameters 
bert. load_state_dict(torch. load(os.path. join(data_dir, 
"pretrained.params’))) 
return bert, vocab 


To facilitate demonstration on most of machines, we will load and fine-tune the small ver- 
sion (“bert.small”) of the pretrained BERT in this section. In the exercise, we will show 
how to fine-tune the much larger “bert.base” to significantly improve the testing accu- 
racy. 


devices = d21.try_all_gpus() 

bert, vocab = load_pretrained_model ( 
‘bert.small', num_hiddens=256, ffn_num_hiddens=512, num_heads=4, 
num_blks=2, dropout=0.1, max_len=512, devices=devices) 


Downloading ../data/bert.small.torch.zip from http://d21-data.s3-accelerate. 
<amazonaws.com/bert.small.torch.zip... 


16.7.2 The Dataset for Fine-Tuning BERT 


For the downstream task natural language inference on the SNLI dataset, we define a cus- 
tomized dataset class SNLIBERTDataset. In each example, the premise and hypothesis 
form a pair of text sequence and is packed into one BERT input sequence as depicted in 
Fig. 16.6.2. Recall Section 15.8.4 that segment IDs are used to distinguish the premise 
and the hypothesis in a BERT input sequence. With the predefined maximum length of a 
BERT input sequence (max_len), the last token of the longer of the input text pair keeps 
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getting removed until max_len is met. To accelerate generation of the SNLI dataset for 
fine-tuning BERT, we use 4 worker processes to generate training or testing examples in 


parallel. 


class SNLIBERTDataset(torch.utils.data.Dataset): 


def 


def 


def 


def 


def 


def 


__init__(self, dataset, max_len, vocab=None): 
all_premise_hypothesis_tokens = [[ 
p_tokens, h_tokens] for p_tokens, h_tokens in zip( 
*[d21.tokenize([s.lower() for s in sentences]) 
for sentences in datasetLl:2]])] 


self.labels = torch. tensor(dataset[2]) 

self.vocab = vocab 

self.max_len = max_len 

(self.all_token_ids, self.all_segments, 

self.valid_lens) = self._preprocess(all_premise_hypothesis_tokens) 
print('’read ' + str(len(self.all_token_ids)) + ' examples’) 


_preprocess(self, all_premise_hypothesis_tokens) : 
pool = multiprocessing.Pool(4) # Use 4 worker processes 
out = pool.map(self._mp_worker, all_premise_hypothesis_tokens) 
all_token_ids = [ 
token_ids for token_ids, segments, valid_len in out] 

all_segments = [segments for token_ids, segments, valid_len in out] 
valid_lens = [valid_len for token_ids, segments, valid_len in out] 
return (torch. tensor(all_token_ids, dtype=torch. long), 

torch. tensor(all_segments, dtype=torch. long), 

torch. tensor(valid_lens)) 


_mp_worker(self, premise_hypothesis_tokens): 
p_tokens, h_tokens = premise_hypothesis_tokens 
self._truncate_pair_of_tokens(p_tokens, h_tokens) 
tokens, segments = d21.get_tokens_and_segments(p_tokens, h_tokens) 
token_ids = self.vocab[tokens] + [self.vocab['<pad>’]] \ 

* (self.max_len - len(tokens)) 
segments = segments + [9] * (self.max_len - len(segments)) 
valid_len = len(tokens) 
return token_ids, segments, valid_len 


_truncate_pair_of_tokens(self, p_tokens, h_tokens): 
# Reserve slots for '<CLS>’, '<SEP>’, and '<SEP>' tokens for the BERT 
# input 
while len(p_tokens) + len(h_tokens) > self.max_len - 3: 
if len(p_tokens) > len(h_tokens): 
p_tokens. pop() 
else: 
h_tokens.pop() 


__getitem__(self, idx): 
return (self.all_token_ids[idx], self.all_segmentsLidx], 
self.valid_lensLidx]), self.labels[idx] 


__len__(self): 
return len(self.all_token_ids) 


After downloading the SNLI dataset, we generate training and testing examples by instan- 
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tiating the SNLIBERTDataset class. Such examples will be read in minibatches during 
training and testing of natural language inference. 


# Reduce ‘batch_size* if there is an out of memory error. In the original BERT 

# model, ‘max_len* = 512 

batch_size, max_len, num_workers = 512, 128, d21.get_dataloader_workers() 

data_dir = d21.download_extract(’SNLI’) 

train_set = SNLIBERTDataset(d21.read_snli(data_dir, True), max_len, vocab) 

test_set = SNLIBERTDataset(d21.read_snli(data_dir, False), max_len, vocab) 

train_iter = torch.utils.data.DataLoader(train_set, batch_size, shuffle=True, 
num_workers=num_workers) 

test_iter = torch.utils.data.DataLoader(test_set, batch_size, 
num_workers=num_workers) 


read 549367 examples 
read 9824 examples 


16.7.3 Fine-Tuning BERT 


As Fig. 16.6.2 indicates, fine-tuning BERT for natural language inference requires only an 
extra MLP consisting of two fully connected layers (see self .hidden and self .output 
in the following BERTClassifier class). This MLP transforms the BERT representation 
of the special “<cls>” token, which encodes the information of both the premise and the 
hypothesis, into three outputs of natural language inference: entailment, contradiction, and 
neutral. 


class BERTClassifier(nn.Module): 
def __init__(self, bert): 
super(BERTClassifier, self).__init__Q 
self.encoder = bert.encoder 
self .hidden = bert.hidden 
self.output = nn.LazyLinear(3) 


def forward(self, inputs): 
tokens_X, segments_X, valid_lens_x = inputs 
encoded_X = self.encoder(tokens_X, segments_X, valid_lens_x) 
return self.output(self.hidden(encoded_X[:, ð, :])) 


In the following, the pretrained BERT model bert is fed into the BERTClassifier instance 
net for the downstream application. In common implementations of BERT fine-tuning, 
only the parameters of the output layer of the additional MLP (net . output) will be learned 
from scratch. All the parameters of the pretrained BERT encoder (net .encoder) and the 
hidden layer of the additional MLP (net. hidden) will be fine-tuned. 


net = BERTClassifier (bert) 


Recall that in Section 15.8 both the MaskLM class and the NextSentencePred class have 
parameters in their employed MLPs. These parameters are part of those in the pretrained 
BERT model bert, and thus part of parameters in net. However, such parameters are 
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only for computing the masked language modeling loss and the next sentence prediction 
loss during pretraining. These two loss functions are irrelevant to fine-tuning downstream 
applications, thus the parameters of the employed MLPs in MaskLM and NextSentencePred 
are not updated (staled) when BERT is fine-tuned. 


To allow parameters with stale gradients, the flag ignore_stale_grad=True is set in the 
step function of d21.train_batch_ch13. We use this function to train and evaluate the 
model net using the training set (train_iter) and the testing set (test_iter) of SNLI. 
Due to the limited computational resources, the training and testing accuracy can be further 
improved: we leave its discussions in the exercises. 


lr, num_epochs = 1e-4, 5 

trainer = torch.optim.Adam(net.parameters(), lr=1r) 

loss = nn.CrossEntropyLoss(reduction='none’) 

net(next(iter(train_iter))[0]) 

d21.train_chl3(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.520, train acc 0.791, test acc 0.786 
10588.8 examples/sec on [device(type='cuda’, index=0), device(type='cuda’',. 
<index=1)] 


— train loss 
0.2 H==- train acc 
—-- test acc 
0.0 T T t 
1 2 3 4 5 
epoch 


16.7.4 Summary 


e We can fine-tune the pretrained BERT model for downstream applications, such as nat- 
ural language inference on the SNLI dataset. 


e During fine-tuning, the BERT model becomes part of the model for the downstream 
application. Parameters that are only related to pretraining loss will not be updated 
during fine-tuning. 


16.7.5 Exercises 


1. Fine-tune a much larger pretrained BERT model that is about as big as the original BERT 
base model if your computational resource allows. Set arguments in the load_pretrained_model 
function as: replacing “bert.small’ with ‘bert.base’, increasing values of num_hiddens=256, 
ffn_num_hiddens=512, num_heads=4, and num_blks=2 to 768, 3072, 12, and 12, re- 
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spectively. By increasing fine-tuning epochs (and possibly tuning other hyperparame- 
ters), can you get a testing accuracy higher than 0.86? 


2. How to truncate a pair of sequences according to their ratio of length? Compare this 
pair truncation method and the one used in the SNLIBERTDataset class. What are their 
pros and cons? 


: 249 
345 Discussions ^^. 
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Pratik Chaudhari (University of Pennsylvania and Amazon), Rasool Fakoor (Amazon), 
and Kavosh Asadi (Amazon) 


Reinforcement Learning (RL) is a suite of techniques that allows us to build machine learn- 
ing systems that take decisions sequentially. For example, a package containing new clothes 
that you purchased from an online retailer arrives at your doorstep after a sequence of de- 
cisions, e.g., the retailer finding the clothes in the warehouse closest to your house, putting 
the clothes in a box, transporting the box via land or by air, and delivering it to your house 
within the city. There are many variables that affect the delivery of the package along the 
way, e.g., whether or not the clothes were available in the warehouse, how long it took to 
transport the box, whether it arrived in your city before the daily delivery truck left, etc. 
The key idea is that at each stage these variables that we do not often control affect the 
entire sequence of events in the future, e.g., if there were delays in packing the box in the 
warehouse the retailer may need to send the package via air instead of ground to ensure a 
timely delivery. Reinforcement Learning methods allow us to take the appropriate action 
at each stage of a sequential decision making problem in order to maximize some utility 
eventually, e.g., the timely delivery of the package to you. 


Such sequential decision making problems are seen in numerous other places, e.g., while 
playing Go?°° your current move determines the next moves and the opponent’s moves are 
the variables that you cannot control... a sequence of moves eventually determines whether 
or not you win; the movies that Netflix recommends to you now determine what you watch, 
whether you like the movie or not is unknown to Netflix, eventually a sequence of movie 
recommendations determines how satisfied you are with Netflix. Reinforcement learning 
is being used today to develop effective solutions to these problems (Mnih et al., 2013, 
Silver et al., 2016). The key distinction between reinforcement learning and standard deep 
learning is that in standard deep learning the prediction of a trained model on one test datum 
does not affect the predictions on a future test datum; in reinforcement learning decisions 
at future instants (in RL, decisions are also called actions) are affected by what decisions 
were made in the past. 


In this chapter, we will develop the fundamentals of reinforcement learning and obtain 
hands-on experience in implementing some popular reinforcement learning methods. We 
will first develop a concept called a Markov Decision Process (MDP) which allows us to 
think of such sequential decision making problems. An algorithm called Value Iteration 
will be our first insight into solving reinforcement learning problems under the assumption 
that we know how the uncontrolled variables in an MDP (in RL, these controlled variables 
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are called the environment) typically behave. Using the more general version of Value 
Iteration, an algorithm called Q-Learning, we will be able to take appropriate actions even 
when we do not necessarily have full knowledge of the environment. We will then study 
how to use deep networks for reinforcement learning problems by imitating the actions of 
an expert. And finally, we will develop a reinforcement learning method that uses a deep 
network to take actions in unknown environments. These techniques form the basis of more 
advanced RL algorithms that are used today in a variety of real-world applications, some 
of which we will point to in the chapter. 


Action 


State s, Reward r, 


4 


p 


St4d 


Reinforcement Learning Structure 


17.1 Markov Decision Process (MDP) 
| 


In this section, we will discuss how to formulate reinforcement learning problems using 
Markov decision processes (MDPs) and describe various components of MDPs in de- 
tail. 


17.1.1 Definition of an MDP 


A Markov decision process (MDP) (Bellman, 1957) is a model for how the state of a system 
evolves as different actions are applied to the system. A few different quantities come 
together to form an MDP. 


e Let S be the set of states in the MDP. As a concrete example see Fig. 17.1.1, for a robot 
that is navigating a gridworld. In this case, S corresponds to the set of locations that 
the robot can be at any given timestep. 


e Let A be the set of actions that the robot can take at each state, e.g., “go forward”, “turn 
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A simple gridworld navigation task where the robot not only has to find its way to the goal 
location (shown as a green house) but also has to avoid trap locations (shown as red cross 


signs). 


39 66 3 66 


right”, “turn left’, “stay at the same location”, etc. Actions can change the current 
state of the robot to some other state within the set S. 


e It may happen that we do not know how the robot moves exactly but only know it up to 
some approximation. We model this situation in reinforcement learning as follows: if 
the robot takes an action “go forward”, there might be a small probability that it stays 
at the current state, another small probability that it “turns left”, etc. Mathematically, 
this amounts to defining a “transition function” T : S x A x S — [0,1] such that 
T(s,a,s’) = P(s’ | s,a) using the conditional probability of reaching a state s” given 
that the robot was at state s and took an action a. The transition function is a probability 
distribution and we therefore have >) ves T(s,a, s’) = 1 forall s € S anda E€ A, i.e., 
the robot has to go to some state if it takes an action. 


e We now construct a notion of which actions are useful and which ones are not using the 
concept of a “reward” r : S x A — R. We say that the robot gets a reward r (s, a) 
if the robot takes an action a at state s. If the reward r(s,a) is large, this indicates 
that taking the action a at state s is more useful to achieving the goal of the robot, i.e., 
going to the green house. If the reward r (s, a) is small, then action a is less useful to 
achieving this goal. It is important to note that the reward is designed by the user (the 
person who creates the reinforcement learning algorithm) with the goal in mind. 


17.1.2 Return and Discount Factor 


The different components above together form a Markov decision process (MDP) 


MDP : (S,A,T,1). (17.1.1) 
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Let’s now consider the situation when the robot starts at a particular state sọ € S and 
continues taking actions to result in a trajectory 


T = (S0, 40,10, S1, 41, F1, S2, 42,12,...)- (17.1.2) 


At each time step ¢ the robot is at a state s; and takes an action a; which results in a reward 
rt = r(s;,a;). The return of a trajectory is the total reward obtained by the robot along 
such a trajectory 


R(t) =rotry t+rot:::. (17.1.3) 
The goal in reinforcement learning is to find a trajectory that has the largest return. 


Think of the situation when the robot continues to travel in the gridworld without ever 
reaching the goal location. The sequence of states and actions in a trajectory can be in- 
finitely long in this case and the return of any such infinitely long trajectory will be infinite. 
In order to keep the reinforcement learning formulation meaningful even for such trajecto- 
ries, we introduce the notion of a discount factor y < 1. We write the discounted return 
as 


R(t) =ro+yri +y r+- =Y vhs (17.1.4) 
t=0 
Note that if y is very small, the rewards earned by the robot in the far future, say t = 
1000, are heavily discounted by the factor y!°°. This encourages the robot to select short 
trajectories that achieve its goal, namely that of going to the green house in the gridwold 
example (see Fig. 17.1.1). For large values of the discount factor, say y = 0.99, the robot 
is encouraged to explore and then find the best trajectory to go to the goal location. 


17.1.3 Discussion of the Markov Assumption 


Let us think of a new robot where the state s+ is the location as above but the action a; is 
the acceleration that the robot applies to its wheels instead of an abstract command like “go 
forward”. If this robot has some non-zero velocity at state s+, then the next location 5,4) is 
a function of the past location s+, the acceleration az, also the velocity of the robot at time 
t which is proportional to s+ — s+-1. This indicates that we should have 


S741 = some function(s;, at, 5;-1)3 (17.1.5) 


the “some function” in our case would be Newton’s law of motion. This is quite different 
from our transition function that simply depends upon s+ and az. 


Markov systems are all systems where the next state s;4; is only a function of the current 
state s; and the action a; taken at the current state. In Markov systems, the next state does 
not depend on which actions were taken in the past or the states that the robot was at in the 
past. For example, the new robot that has acceleration as the action above is not Markovian 
because the next location s;4; depends upon the previous state s;_; through the velocity. 
It may seem that Markovian nature of a system is a restrictive assumption, but it is not so. 
Markov Decision Processes are still capable of modeling a very large class of real systems. 
For example, for our new robot, if we chose our state s, to the tuple (location, velocity) 
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then the system is Markovian because its next state (location;,), velocity,,,) depends only 
upon the current state (location,, velocity,) and the action at the current state az. 


17.1.4 Summary 


The reinforcement learning problem is typically modeled using Markov Decision Pro- 
cesses. A Markov decision process (MDP) is defined by a tuple of four entities (S, A, T, 7) 
where S is the state space, A is the action space, T is the transition function that encodes 
the transition probabilities of the MDP and r is the immediate reward obtained by taking 
action at a particular state. 


17.1.5 Exercises 


251 1. Suppose that we want to design an MDP to model MountainCar?*! problem. 


1. What would be the set of states? 


2. What would be the set of actions? 


3. What would be the possible reward functions? 


2. How would you design an MDP for an Atari game like Pong game?°?? 
Discussions?°?. 

253 i 

mym 17.2 Value Iteration 

Ee 


In this section we will discuss how to pick the best action for the robot at each state to 
maximize the return of the trajectory. We will describe an algorithm called Value Iteration 
and implement it for a simulated robot that travels over a frozen lake. 


17.2.1 Stochastic Policy 


A stochastic policy denoted as z(a | s) (policy for short) is a conditional distribution over 
the actions a € A given the state s € S, m(a | s) = P(a | s). As an example, if the 
robot has four actions A = {go left, go down, go right, go up}. The policy at a state 
s € S for such a set of actions A is a categorical distribution where the probabilities of 
the four actions could be [0.4, 0.2, 0.1, 0.3]; at some other state s’ € S the probabilities 
z(a | s’) of the same four actions could be [0.1,0.1,0.2, 0.6]. Note that we should have 
Zana | s) = 1 for any state s. A deterministic policy is a special case of a stochastic 
policy in that the distribution z(a | s) only gives non-zero probability to one particular 
action, e.g., [1,0,0,0] for our example with four actions. 


To make the notation less cumbersome, we will often write z(s) as the conditional distri- 
bution instead of z(a | s). 
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17.2.2 Value Function 


Imagine now that the robot starts at a state sọ and at each time instant, it first samples 
an action from the policy a; ~ (s+) and takes this action to result in the next state 
St+1. The trajectory T = (so, a0, Fo, 81,41,11,---), can be different depending upon which 
particular action a; is sampled at intermediate instants. We define the average return 
R(t) = Veo y'r(sr, ar) of all such trajectories 


V" (50) = Bani.) | R(T)| = Eea b> y'r(sr.ar)], (17.2.1) 
t=0 


where S741 ~ P(St+1 | 5;, ar) is the next state of the robot and r(s;, a;) is the instantaneous 
reward obtained by taking action a; in state s+ at time t. This is called the “value function” 
for the policy z. In simple words, the value of a state so for a policy z, denoted by V” (sọ), 
is the expected y-discounted return obtained by the robot if it begins at state sọ and takes 
actions from the policy z at each time instant. 


We next break down the trajectory into two stages (i) the first stage which corresponds to 
SQ — sı upon taking the action ag, and (ii) a second stage which is the trajectory tT’ = 
(s1,41,11,-..) thereafter. The key idea behind all algorithms in reinforcement learning is 
that the value of state sọ can be written as the average reward obtained in the first stage 
and the value function averaged over all possible next states sı. This is quite intuitive and 
arises from our Markov assumption: the average return from the current state is the sum 
of the average return from the next state and the average reward of going to the next state. 
Mathematically, we write the two stages as 


V (so) = r(So, ao) + Yy Eag~z(so) Esi~P(sı |s0,a0) [v7] bj (17.2.2) 


This decomposition is very powerful: it is the foundation of the principle of dynamic pro- 
gramming upon which all reinforcement learning algorithms are based. Notice that the 
second stage gets two expectations, one over the choices of the action ao taken in the first 
stage using the stochastic policy and another over the possible states sı obtained from the 
chosen action. We can write (17.2.2) using the transition probabilities in the Markov de- 
cision process (MDP) as 


V7(s) = X z(a |s) [r(s, a) +y T P(s’ | s,a)V“(s’)}; forall s € S. (17.2.3) 
actA s'ES 


An important thing to notice here is that the above identity holds for all states s € S because 
we can think of any trajectory that begins at that state and break down the trajectory into 
two stages. 


17.2.3 Action-Value Function 


In implementations, it is often useful to maintain a quantity called the “action value” func- 
tion which is a closely related quantity to the value function. This is defined to be the 
average return of a trajectory that begins at sg but when the action of the first stage is fixed 
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to be ag 
O” (so, 40) = r(s0, a0) + Ea~nisiy| X, 77r ar)|, (17.2.4) 
t=1 


note that the summation inside the expectation is from t = 1, .. . , oo because the reward of 
the first stage is fixed in this case. We can again break down the trajectory into two parts 
and write 


Q” (s,a)=r(s,a)+y y P(s’ | s,a) by m(a’ |s) O7(s’,a’); forallse Sac A. 


s'ES aveA 
(17.2.5) 


This version is the analog of (17.2.3) for the action value function. 


17.2.4 Optimal Stochastic Policy 


Both the value function and the action-value function depend upon the policy that the robot 
chooses. We will next think of the “optimal policy” that achieves the maximal average 
return 


n* = argmaxV“ (so). (17.2.6) 


Of all possible stochastic policies that the robot could have taken, the optimal policy 7* 
achieves the largest average discounted return for trajectories starting from state so. Let us 
denote the value function and the action-value function of the optimal policy as V* = V7 


and Q* = Q”. 


Let us observe that for a deterministic policy where there is only one action that is possible 
under the policy at any given state. This gives us 


n*(s) = argmax|r(s, a) +y ` P(s’ | s,a) V*(s’)}. (17.2.7) 
aCA a 
eS 
A good mnemonic to remember this is that the optimal action at state s (for a deterministic 
policy) is the one that maximizes the sum of reward r(s,a) from the first stage and the 
average return of the trajectories starting from the next sate s’, averaged over all possible 
next states s’ from the second stage. 


17.2.5 Principle of Dynamic Programming 


Our developement in the previous section in (17.2.2) or (17.2.5) can be turned into an 
algorithm to compute the optimal value function V* or the action-value function Q*, re- 
spectively. Observe that 


V*(s) = oy (als) [r(s.a) +y T P(s’ | s,a)V*(s')]: for alls € S. (17.2.8) 
acA sveS 


For a deterministic optimal policy 2*, since there is only one action that can be taken at 
state s, we can also write 


Vi(s)= argmax yea {CS a)+y DE P(s’ | s,a)V*(s")} (17.2.9) 
SES 
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for all states s € S. This identity is called the “principle of dynamic programming” (Bell- 
man, 1952, Bellman, 1957). It was formulated by Richard Bellman in 1950s and we can 
remember it as “the remainder of an optimal trajectory is also optimal”. 


17.2.6 Value Iteration 


We can turn the principle of dynamic programming into an algorithm for finding the optimal 
value function called value iteration. The key idea behind value iteration is to think of this 
identity as a set of constraints that tie together V* (s) at different states s € S. We initialize 
the value function to some arbitrary values Vo(s) for all states s € S. At the k™ iteration, 
the Value Iteration algorithm updates the value function as 


Vea (s) = max {r(s, a)+y 2 P(s’ | s, a)Va(s’)}; forall s € S. (17.2.10) 
SE 
It turns out that as k — oo the value function estimated by the Value Iteration algorithm 
converges to the optimal value function irrespective of the initialization Vo, 


Vi(s) = jim Vi(s); for all states s € S. (17.2.11) 


The same Value Iteration algorithm can be equivalently written using the action-value func- 
tion as 


Qk+(s,a) =r(s,a)+y max > P(s’ | s,a)Ox(s’,a’); forall se Sace A. 
wE 
s'ES 
(17.2.12) 


In this case we initialize Qo(s,a) to some arbitrary values for all s € S anda € A. Again 
we have Q*(s,a) = limg—so0 Ox(s, a) for all s € S and a E A. 


17.2.7 Policy Evaluation 


Value Iteration enables us to compute the optimal value function, i.e., V7” of the optimal 
deterministic policy 2*. We can also use similar iterative updates to compute the value 
function associated with any other, potentially stochastic, policy 7. We again initialize 
V5‘ (s) to some arbitrary values for all states s € S and at the k® iteration, perform the 
updates 


Vis) = z z(a |s) [r(s, a) +y 2 P(s’ | s,a)V(s’)|; forall s € S. (17.2.13) 
acA ES 


This algorithm is known as policy evaluation and is useful to compute the value function 
given the policy. Again, it turns out that as k — oo these updates converge to the correct 
value function irrespective of the initialization Vo, 


V7(s)= jim V7 (s); for all states s € S. (17.2.14) 


The algorithm for computing the action-value function Q7(s,a) of a policy z is analo- 
gous. 
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17.2.8 Implementation of Value Iteration 


We next show how to implement Value Iteration for a navigation problem called FrozenLake 


from Open AI Gym?°4 


code. 


%matplotlib inline 

import random 

import numpy as np 

from d21 import torch as d21 


. We first need to setup the enviroment as shown in the following 


seed = © # Random number generator seed 


gamma = @.95 # Discount factor 


num_iters = 10 # Number of iterations 
random.seed(seed) # Set the random seed to ensure results can be reproduced 


np. random. seed(seed) 


# Now set up the environment 


env_info = d21.make_env('’FrozenLake-vl', seed=seed) 


In the FrozenLake environment, the robot moves on a 4 x 4 grid (these are the states) with 
actions that are “up” (T), “down” (—), “left” (—), and “right” (—). The environment 


contains a number of holes (H) cells 


and frozen (F) cells as well as a goal cell (G), all of 


which are unknown to the robot. To keep the problem simple, we assume the robot has 
reliable actions, i.e. P(s’ | s,a) = 1 forall s € S,a € A. If the robot reaches the goal, the 
trial ends and the robot receives a reward of 1 irrespective of the action; the reward at any 
other state is 0 for all actions. The objective of the robot is to learn a policy that reaches the 
goal location (G) from a given start location (S) (this is sọ) to maximize the return. 


The following function implements Value Iteration, where env_info contains MDP and 
environment related information and gamma is the discount factor: 


def value_iteration(env_info, gamma, num_iters): 


env_desc = env_info[ ‘desc’ ] 


# 2D array shows what each item means 


prob_idx = env_info[’trans_prob_idx’] 
nextstate_idx = env_info[L'nextstate_idx’] 


reward_idx = env_infoL[’' reward 


_idx'] 


num_states = env_info[’num_states’] 
num_actions = env_info[’num_actions'] 


mdp = env_info['mdp’] 


iT 


V np.zeros((num_iters + 1, 
Q = np.zeros((num_iters + 1, 
pi = np.zeros((num_iters + 1, 


for k in range(1, num_iters + 


num_states)) 
num_states, num_actions)) 
num_states)) 


iDye 


for s in range(num_states): 
for a in range(num_actions): 


# Calculate \sum_{s'} p(s'\mid s,a) [r + \gamma v_k(s')] 


for pxrds in mdp[(s,a)]: 


=? imdpls rade No next ell Cub) (Gey siege (erste), 5 o Il 


pr = pxrds[prob_idx] 


# p(s'\mid s,a) 


(continues on next page) 


790 Reinforcement Learning 


(continued from previous page) 


nextstate = pxrds[nextstate_idx] # Next state 
reward = pxrds[reward_idx] # Reward 
QLk,s,a] += pr * (reward + gamma x V[k - 1, nextstate]) 
# Record max value and max action 
V[k,s] = np.max(QL[k,s,:]) 
pilk,s] = np.argmax(Q[k,s, :]) 
d21.show_value_function_progress(env_desc, VL:-1], pil:-1]) 


value_iteration(env_info=env_info, gamma=gamma, num_iters=num_iters) 


The above pictures show the policy (the arrow indicates the action) and value function (the 
change in color shows how the value function changes over time from the initial value shown 
by dark color to the optimal value shown by light colors.). As we see, Value Iteration finds 
the optimal value function after 10 iterations and the goal state (G) can be reached starting 
from any state as long as itis not an H cell. Another interesting aspect of the implementation 
is that in addition to finding the optimal value function, we also automatically found the 
optimal policy z* corresponding to this value function. 


17.2.9 Summary 


The main idea behind the Value Iteration algorithm is to use the principle of dynamic pro- 
gramming to find the optimal average return obtained from a given state. Note that imple- 
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menting the Value Iteration algorithm requires that we know the Markov decision process 
(MDP), e.g., the transition and reward functions, completely. 


17.2.10 Exercises 


1. Try increasing the grid size to 8 x 8. Compared with 4 x 4 grid, how many iterations 
does it take to find the optimal value function? 


2. What is the computational complexity of the Value Iteration algorithm? 


3. Run the Value Iteration algorithm again with y (i.e. “gamma” in the above code) when 
it equals to 0, 0.5, and | and analyze its results. 


4. How does the value of y affect the number of iterations taken by Value Iteration to 
converge? What happens when y = 1? 


Discussions?°°. 


17.3 Q-Learning 
=] 


In the previous section, we discussed the Value Iteration algorithm which requires accessing 
the complete Markov decision process (MDP), e.g., the transition and reward functions. In 
this section, we will look at Q-Learning (Watkins and Dayan, 1992) which is an algorithm 
to learn the value function without necessarily knowing the MDP. This algorithm embodies 
the central idea behind reinforcement learning: it will enable the robot to obtain its own 
data. 


17.3.1 The Q-Learning Algorithm 


Recall that value iteration for the action-value function in Value Iteration (page 785) cor- 
responds to the update 


Qka (s,a)=r(s,a)+y ) P(s’ | s,a) max Ox(s’, a’); forall s € S anda E€ A. 
ae 
eS 
(17.3.1) 


As we discussed, implementing this algorithm requires knowing the MDP, specifically the 
transition function P(s’ | s,a). The key idea behind Q-Learning is to replace the summa- 
tion over all s’ € S in the above expression by a summation over the states visited by the 
robot. This allows us to subvert the need to know the transition function. 


17.3.2 An Optimization Problem Underlying Q-Learning 


Let us imagine that the robot uses a policy 7-(a | s) to take actions. Just like the previous 
chapter, it collects a dataset of n trajectories of T timesteps each {(st, a'),-0 AS T1 eins 
Recall that value iteration is really a set of constraints that ties together the action-value 
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Q*(s,a) of different states and actions to each other. We can implement an approximate 
version of value iteration using the data that the robot has collected using me as 


n T-1 
A 1 a. és ae i 
Q = min — (Q(s;,a;) — r(s;,a;) — ymax Q (s, wey ; 
Q nT 2 2, a y (17.3.2) 


Leco) 


Let us first observe the similarities and differences between this expression and value iter- 
ation above. If the robot’s policy me were equal to the optimal policy 7*, and if it collected 
an infinite amount of data, then this optimization problem would be identical to the opti- 
mization problem underlying value iteration. But while value iteration requires us to know 
P(s’ | s,a), the optimization objective does not have this term. We have not cheated: as 
the robot uses the policy 7e to take an action a! at state s', the next state si 1 İS a sample 
drawn from the transition function. So the optimization objective also has access to the 


transition function, but implicitly in terms of the data collected by the robot. 


The variables of our optimization problem are Q(s, a) for all s e S anda € A. We can 
minimize the objective using gradient descent. For every pair (si, a’) in our dataset, we 
can write 


Q(s',a') — O(s', al) - aV o(si,ai) f(Q) 


ar ee : (17.3.3) 
= (1 = a) O(s}, ai) = o(r(s}, ai) + y max O(s,,.4')), 

where a is the learning rate. Typically in real problems, when the robot reaches the goal 
location, the trajectories end. The value of such a terminal state is zero because the robot 
does not take any further actions beyond this state. We should modify our update to handle 
such states as 


Olsi, ai) = (1 = a) OCs}, ai) - a(r(si, ah) + (1 =H, is terminal) MAX OCs), 1,4”). 
(17.3.4) 
where si is terminal is an indicator variable that is one if si 1 ÍS a terminal state and zero 


otherwise. The value of state-action tuples (s, a) that are not a part of the dataset is set to 
—oo, This algorithm is known as Q-Learning. 


Given the solution of these updates Ô, which is an approximation of the optimal value 
function Q*, we can obtain the optimal deterministic policy corresponding to this value 
function easily using 


f(s) = argmax,,O(s, a). (17.3.5) 


There can be situations when there are multiple deterministic policies that correspond to 
the same optimal value function; such ties can be broken arbitrarily because they have the 
same value function. 


17.3.3 Exploration in Q-Learning 
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The policy used by the robot to collect data 7e is critical to ensure that Q-Learning works 
well. Afterall, we have replaced the expectation over s’ using the transition function P(s’ | 
s,a) using the data collected by the robot. If the policy me does not reach diverse parts of 
the state-action space, then it is easy to imagine our estimate O will be a poor approximation 
of the optimal Q*. It is also important to note that in such a situation, the estimate of Q* at 
all states s € S will be bad, not just the ones visited by 7e. This is because the Q-Learning 
objective (or value iteration) is a constraint that ties together the value of all state-action 
pairs. It is therefore critical to pick the correct policy me to collect data. 


We can mitigate this concern by picking a completely random policy 7, that samples actions 
uniformly randomly from A. Such a policy would visit all states, but it will take a large 
number of trajectories before it does so. 


We thus arrive at the second key idea in Q-Learning, namely exploration. Typical im- 
plementations of Q-Learning tie together the current estimate of Q and the policy me to 
set 


,O(s,a’) with prob. 1 - 
ie DS eee Q(s,a’) with pro € (17.3.6) 


uniform(A) with prob. €, 


where e is called the “exploration parameter” and is chosen by the user. The policy me is 
called an exploration policy. This particular 7 is called an e-greedy exploration policy 
because it chooses the optimal action (under the current estimate Q) with probability 1 — € 
but explores randomly with the remainder probability e. We can also use the so-called 
softmax exploration policy 


eO(s,a)/T 


S eĝa IT (17.3.7) 


Tela | s) = 
where the hyper-parameter T is called temperature. A large value of € in e-greedy policy 
functions similarly to a large value of temperature T for the softmax policy. 


It is important to note that when we pick an exploration that depends upon the current 
estimate of the action-value function Ô, we need to resolve the optimization problem peri- 
odically. Typical implementations of Q-Learning make one mini-batch update using a few 
state-action pairs in the collected dataset (typically the ones collected from the previous 
timestep of the robot) after taking every action using Te. 


17.3.4 The “‘Self-correcting” Property of Q-Learning 


The dataset collected by the robot during Q-Learning grows with time. Both the exploration 
policy 7e and the estimate O evolve as the robot collects more data. This gives us a key 
insight into why Q-Learning works well. Consider a state s: if a particular action a has 
a large value under the current estimate Ô(s,a), then both the e-greedy and the softmax 
exploration policies have a larger probability of picking this action. If this action actually is 
not the ideal action, then the future states that arise from this action will have poor rewards. 
The next update of the Q-Learning objective will therefore reduce the value O(s, a), which 
will reduce the probability of picking this action the next time the robot visits state s. Bad 
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actions, e.g., ones whose value is overestimated in Q(s, a), are explored by the robot but 
their value is correct in the next update of the Q-Learning objective. Good actions, e.g., 
whose value O(s, a) is large, are explored more often by the robot and thereby reinforced. 
This property can be used to show that Q-Learning can converge to the optimal policy even 
if it begins with a random policy 7, (Watkins and Dayan, 1992). 


This ability to not only collect new data but also collect the right kind of data is the cen- 
tral feature of reinforcement learning algorithms, and this is what distinguishes them from 
supervised learning. Q-Learning, using deep neural networks (which we will see in the 
DQN chapeter later), is responsible for the resurgence of reinforcement learning (Mnih et 
al., 2013). 


17.3.5 Implementation of Q-Learning 


We now show how to implement Q-Learning on FrozenLake from Open AI Gym?°°. Note 
this is the same setup as we consider in Value Iteration (page 785) experiment. 


zmatplotlib inline 

import random 

import numpy as np 

from d21 import torch as d21 


seed = @ # Random number generator seed 

gamma = @.95 # Discount factor 

num_iters = 256 # Number of iterations 

alpha = 0.9 # Learing rate 

epsilon = 0.9 # Epsilon in epsilion gready algorithm 
random.seed(seed) # Set the random seed 

np. random. seed(seed) 


# Now set up the environment 
env_info = d2l.make_env('FrozenLake-v1', seed=seed) 


In the FrozenLake environment, the robot moves on a 4 x 4 grid (these are the states) with 
actions that are “up” (T), “down” (—), “left” (—), and “right” (—). The environment 
contains a number of holes (H) cells and frozen (F) cells as well as a goal cell (G), all of 
which are unknown to the robot. To keep the problem simple, we assume the robot has 
reliable actions, i.e. P(s’ | s,a) = 1 forall s € S,a € A. If the robot reaches the goal, the 
trial ends and the robot receives a reward of 1 irrespective of the action; the reward at any 
other state is 0 for all actions. The objective of the robot is to learn a policy that reaches the 
goal location (G) from a given start location (S) (this is sọ) to maximize the return. 


We first implement e-greedy method as follows: 


def e_greedy(env, Q, s, epsilon): 
if random.random() < epsilon: 
return env.action_space.sample() 


else: 
return np.argmax(QLs, :]) 
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We are now ready to implement Q-learning: 


def q_learning(env_info, gamma, num_iters, alpha, epsilon): 
env_desc = env_info['desc'] # 2D array specifying what each grid item. 
means 
env = env_info['env’] # 2D array specifying what each grid item means 
num_states = env_info['num_states'] 
num_actions = env_info[’num_actions’ ] 


Q = np.zeros((num_states, num_actions)) 
V = np.zeros((num_iters + 1, num_states)) 
pi = np.zeros((num_iters + 1, num_states)) 


for k in range(1, num_iters + 1): 

# Reset environment 

state, done = env.reset(), False 

while not done: 
# Select an action for a given state and acts in env based on, 

«selected action 

action = e_greedy(env, Q, state, epsilon) 
next_state, reward, done, _ = env.step(action) 


# Q-update: 

y = reward + gamma * np.max(Q[next_state, :]) 

QLstate, action] = Q[state, action] + alpha * (y - Q[state,. 
~action]) 


# Move to the next state 
state = next_state 
# Record max value and max action for visualization purpose only 
for s in range(num_states): 
V[k,s] = np.max(QLs,:]) 
pilk,s] = np.argmax(QLs, :]) 
d21.show_Q_function_progress(env_desc, V[:-1], pil:-1]) 


q_learning(env_info=env_info, gamma=gamma, num_iters=num_iters, alpha=alpha, 


~epsilon=epsilon) 


This result shows that Q-learning can find the optimal solution for this problem roughly after 
250 iterations. However, when we compare this result with the Value Iteration algorithm’s 
result (see Implementation of Value Iteration (page 789)), we can see that the Value Iteration 
algorithm needs way fewer iterations to find the optimal solution for this problem. This 
happens because the Value Iteration algorithm has access to the full MDP whereas Q- 


learning does not. 


17.3.6 Summary 


Q-learning is one of the most fundamental reinforcement-learning algorithms. It has been 
at the epicenter of the recent success of reinforcement learning, most notably in learning 
to play video games (Mnih et al., 2013). Implementing Q-learning does not require that 
we know the Markov decision process (MDP), e.g., the transition and reward functions, 


completely. 
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17.3.7 Exercises 


. Try increasing the grid size to 8 x 8. Compared with 4 x 4 grid, how many iterations 


does it take to find the optimal value function? 


2. Run the Q-learning algorithm again with y (i.e. “gamma” in the above code) when it 
equals to 0, 0.5, and 1 and analyze its results. 


3. Run the Q-learning algorithm again with e (i.e. “epsilon” in the above code) when it 
equals to 0, 0.5, and 1 and analyze its results. 


Discussions 2°" . 
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Andrew Gordon Wilson (New York University and Amazon) 


Gaussian processes (GPs) are ubitiquous. You have already encountered many examples of 
GPs without realizing it. Any model that is linear in its parameters with a Gaussian distri- 
bution over the parameters is a Gaussian process. This class spans discrete models, includ- 
ing random walks, and autoregressive processes, as well as continuous models, including 
Bayesian linear regression models, polynomials, Fourier series, radial basis functions, and 
even neural networks with an infinite number of hidden units. There is a running joke that 
“everything is a special case of a Gaussian process”. 


Learning about Gaussian processes is important for three reasons: (1) they provide a func- 
tion space perspective of modelling, which makes understanding a variety of model classes, 
including deep neural networks, much more approachable; (2) they have an extraordinary 
range of applications where they are state-of-the-art, including active learning, hyperpa- 
rameter learning, auto-ML, and spatiotemporal regression; (3) over the last few years, 
algorithmic advances have made Gaussian processes increasingly scalable and relevant, 
harmonizing with deep learning through frameworks such as GPyTorch 7°° (Gardner et 
al., 2018). Indeed, GPs and and deep neural networks are not competing approaches, but 
highly complementary, and can be combined to great effect. These algorithmic advances 
are not just relevant to Gaussian processes, but provide a foundation in numerical methods 
that is broadly useful in deep learning. 


In this chapter, we introduce Gaussian processes. In the introductory notebook, we start 
by reasoning intuitively about what Gaussian processes are and how they directly model 
functions. In the priors notebook, we focus on how to specify Gaussian process priors. 
We directly connect the tradiational weight-space approach to modelling to function space, 
which will help us reason about constructing and understanding machine learning mod- 
els, including deep neural networks. We then introduce popular covariance functions, also 
known as kernels, which control the generalization properties of a Gaussian process. A 
GP with a given kernel defines a prior over functions. In the inference notebook, we will 
show how to use data to infer a posterior, in order to make predictions. This notebook 
contains from-scratch code for making predictions with a Gaussian process, as well as an 
introduction to GPyTorch. In upcoming notebooks, we will introduce the numerics behind 
Gaussian processes, which is useful for scaling Gaussian processes but also a powerful gen- 
eral foundation for deep learning, and advanced use-cases such as hyperparameter tuning in 
deep learning. Our examples will make use of GPyTorch, which makes Gaussian processes 
scale, and is closely integrated with deep learning functionality and PyTorch. 
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18.1 Introduction to Gaussian Processes 
| 


In many cases, machine learning amounts to estimating parameters from data. These pa- 
rameters are often numerous and relatively uninterpretable — such as the weights of a neu- 
ral network. Gaussian processes, by contrast, provide a mechanism for directly reasoning 
about the high-level properties of functions that could fit our data. For example, we may 
have a sense of whether these functions are quickly varying, periodic, involve conditional 
independencies, or translation invariance. Gaussian processes enable us to easily incorpo- 
rate these properties into our model, by directly specifying a Gaussian distribution over the 
function values that could fit our data. 


Let’s get a feel for how Gaussian processes operate, by starting with some examples. 


Suppose we observe the following dataset, of regression targets (outputs), y, indexed by 
inputs, x. As an example, the targets could be changes in carbon dioxide concentrations, 
and the inputs could be the times at which these targets have been recorded. What are 
some features of the data? How quickly does it seem to varying? Do we have data points 
collected at regular intervals, or are there missing inputs? How would you imagine filling 
in the missing regions, or forecasting up until x = 25? 
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| Observed data. 


In order to fit the data with a Gaussian process, we start by specifying a prior distribution 
over what types of functions we might believe to be reasonable. Here we show several 
sample functions from a Gaussian process. Does this prior look reasonable? Note here 
we are not looking for functions that fit our dataset, but instead for specifying reasonable 
high-level properties of the solutions, such as how quickly they vary with inputs. Note that 
we will see code for reproducing all of the plots in this notebook, in the next notebooks on 
priors and inference. 


Once we condition on data, we can use this prior to infer a posterior distribution over func- 
tions that could fit the data. Here we show sample posterior functions. 


We see that each of these functions are entirely consistent with our data, perfectly running 
through each observation. In order to use these posterior samples to make predictions, we 
can average the values of every possible sample function from the posterior, to create the 
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isy ii Sample prior functions that we may want to represent with our model. 


Why 


a if) Sample posterior functions, once we have observed the data. 


curve below, in thick blue. Note that we do not actually have to take an infinite number of 
samples to compute this expectation; as we will see later, we can compute the expectation 
in closed form. 


0 5 10 15 20 25 


a k Posterior samples, alongside posterior mean, which can be used for point predictions, in 
blue. 


We may also want a representation of uncertainty, so we know how confident we should 
be in our predictions. Intuitively, we should have more uncertainty where there is more 
variability in the sample posterior functions, as this tells us there are many more possible 
values the true function could take. This type of uncertainty is called epistemic uncertainty, 
which is the reducible uncertainty associated with lack of information. As we acquire more 
data, this type of uncertainty disappears, as there will be increasingly fewer solutions con- 
sistent with what we observe. Like with the posterior mean, we can compute the posterior 
variance (the variability of these functions in the posterior) in closed form. With shade, 
we show two times the posterior standard deviation on either side of the mean, creating a 
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credible interval that has a 95% probability of containing the true value of the function for 
any input x. 


Posterior samples, including 95% credible set. 


The plot looks somewhat cleaner if we remove the posterior samples, simply visualizing 
the data, posterior mean, and 95% credible set. Notice how the uncertainty grows away 
from the data, a property of epistemic uncertainty. 


Point predictions, and credible set. 


The properties of the Gaussian process that we used to fit the data are strongly controlled 
by what’s called a covariance function, also known as a kernel. The covariance function 
we used is called the RBF (Radial Basis Function) kernel, which has the form 


kppr(x,x") = Cov( f(x), f(x’) = a? exp (r - vil) (18.1.1) 


The hyperparameters of this kernel are interpretable. The amplitude parameter a controls 
the vertical scale over which the function is varying, and the /ength-scale parameter £ con- 
trols the rate of variation (the wiggliness) of the function. Larger a means larger function 
values, and larger £ means more slowly varying functions. Let’s see what happens to our 
sample prior and posterior functions as we vary a and f. 


The /ength-scale has a particularly pronounced effect on the predictions and uncertainty of 
a GP. At ||x — x’|| = £ , the covariance between a pair of function values is a? exp(—0.5). 
At larger distances than £ , the values of the function values becomes nearly uncorrelated. 
This means that if we want to make a prediction at a point x,, then function values with 
inputs x such that ||x — x’|| > £ will not have a strong effect on our predictions. 


Let’s see how changing the lengthscale affects sample prior and posterior functions, and 
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credible sets. The above fits use a length-scale of 2. Let’s now consider £ = 0.1, 0.5, 2,5, 10 
. A length-scale of 0.1 is very small relative to the range of the input domain we are consid- 
ering, 25. For example, the values of the function at x = 5 and x = 10 will have essentially 
no correlation at such a length-scale. On the other hand, for a length-scale of 10, the func- 
tion values at these inputs will be highly correlated. Note that the vertical scale changes in 
the following figures. 


Lengthscale £=0.100, Amplitude a=1 


Lengthscale £=0.500, Amplitude a=1 
2 4 
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Lengthscale 2=0.500, Amplitude a=1 


Lengthscale £=2.000, Amplitude a=1 


Lengthscale 2=5.000, Amplitude a=1 
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Lengthscale £=5.000, Amplitude a=1 
2 ad 


Notice as the length-scale increases the ‘wiggliness’ of the functions decrease, and our 
uncertainty decreases. If the length-scale is small, the uncertainty will quickly increase as 
we move away from the data, as the datapoints become less informative about the function 
values. 


Now, let’s vary the amplitude parameter, holding the length-scale fixed at 2. Note the ver- 
tical scale is held fixed for the prior samples, and varies for the posterior samples, so you 
can clearly see both the increasing scale of the function, and the fits to the data. 


„engthscale ł=2, Amplitude a=0.100 


204 
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Lengthscale £=2, Amplitude a=0.100 
2 4 
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„engthscale 2=2, Amplitude a=2.000 


204 


Lengthscale /=2, Amplitude a=2.000 
44 


204 
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10 5 


We see the amplitude parameter affects the scale of the function, but not the rate of variation. 
At this point, we also have the sense that the generalization performance of our procedure 
will depend on having reasonable values for these hyperparameters. Values of £ = 2 anda = 
1 appeared to provide reasonable fits, while some of the other values did not. Fortunately, 
there is a robust and automatic way to specify these hyperparameters, using what is called 
the marginal likelihood, which we will return to in the notebook on inference. 
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So what is a GP, really? As we started, a GP simply says that any collection of function 
values f(x,),..., f(%n), indexed by any collection of inputs x;,...,x, has a joint multi- 
variate Gaussian distribution. The mean vector u of this distribution is given by a mean 
function, which is typically taken to be a constant or zero. The covariance matrix of this 
distribution is given by the kernel evaluated at all pairs of the inputs x. 


F(x) K(x,x)  k(x,xı) ...  k(x,Xn) 
fœ) K(x1,x) k(x1,x1) ... K(X, Xn) 
: ~N| pb, ; : ; ; (18.1.2) 
Faa) K(Xn, x) k(xn, x1) tee K(Xn,Xn) 
Equation (18.1.2) specifies a GP prior. We can compute the conditional distribution of f(x) 
for any x given f(x1),..., f (xn), the function values we have observed. This conditional 
distribution is called the posterior, and it is what we use to make predictions. 
In particular, 
FOS E), - ++ Fn) ~ N(m, 8”) (18.1.3) 
where 
m= k(x, Xin) k (X:n X1:n) Fh) (18.1.4) 
s* k(x, x) a K(x, Xin) k (Xin, Xn) k(x, X1:n) (18.1.5) 
where k(x, x1-,) is a 1 x n vector formed by evaluating k(x,x;) fori = 1,...,n and 
K(X1:n,X1n) is an n X n matrix formed by evaluating k(x;,x;) for i,j = 1,...,n. m is 


what we can use as a point predictor for any x, and s? is what we use for uncertainty: if we 
want to create an interval with a 95% probability that f(x) is in the interval, we would use 
m+2s. The predictive means and uncertainties for all the above figures were created using 
these equations. The observed data points were given by f(x,),..., f (Xn) and chose a fine 
grained set of x points to make predictions. 


Let’s suppose we observe a single datapoint, f(x,), and we want to determine the value 
of f(x) at some x. Because f(x) is described by a Gaussian process, we know the joint 
distribution over (f(x), f(x1)) is Gaussian: 


f(x) | 
a s | i 


The off-diagonal expression k(x, x1) = k(x1,x) tells us how correlated the function values 
will be — how strongly determined f(x) will be from f(x1;). We have seen already that if 
we use a large length-scale, relative to the distance between x and x1, ||x — xı ||, then the 
function values will be highly correlated. We can visualize the process of determining f(x) 
from f(x;) both in the space of functions, and in the joint distribution over f(x,), f(x). 
Let’s initially consider an x such that k(x,x,) = 0.9, and k(x,x) = 1, meaning that the 
value of f(x) is moderately correlated with the value of f (x1). In the joint distribution, the 
contours of constant probability will be relatively narrow ellipses. 


(18.1.6) 


k(x, x) eal 
k(x1,x) k(x1,%1) 


Suppose we observe f(x;) = 1.2. To condition on this value of f(x;), we can draw a 
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horizontal line at 1.2 on our plot of the density, and see that the value of f(x) is mostly 
constrained to [0.64, 1.52]. We have also drawn this plot in function space, showing the 
observed point f (x1) in orange, and 1 standard deviation of the Gaussian process predictive 
distribution for f(x) in blue, about the mean value of 1.08. 
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Now suppose we have a stronger correlation, k(x,x,) = 0.95. Now the ellipses have nar- 
rowed further, and the value of f(x) is even more strongly determined by f(x,). Draw- 
ing a horizontal line at 1.2, we see the contours for f(x) support values mostly within 
[0.83, 1.45]. Again, we also show the plot in function space, with one standard deviation 
about the mean predictive value of 1.14. 


Correlation = 0.95 


rr 0.56 
ty 0.48 
27 Hy 0.40 
Py 0.32 
07 Hy 0.24 
fy 0.16 
-24 Hy 0.08 
T T T — 0.00 


f(x1) 


807 


Introduction to Gaussian Processes 


Correlation = 0.95 


= 
a 


= 
P 
f 


E 
N 
1 
e 


Observations y 
oO e 
o o 


° 
a 
f 


Xı x 
Inputs x 


We see that the posterior mean predictor of our Gaussian process is closer to 1.2, be- 
cause there is now a stronger correlation. We also see that our uncertainty (the error bars) 
have somewhat decreased. Despite the strong correlation between these function values, 
our uncertainty is still righly quite large, because we have only observed a single data 
point! 


This procedure can give us a posterior on f(x) for any x, for any number of points we have 
observed. Suppose we observe f (x1), f(x2). We now visualize the posterior for f(x) ata 
particular x = x’ in function space. The exact distribution for f(x) is given by the above 
equations. f(x) is Gaussian distributed, with mean 


m = k(x,x13)k(x13,413) f (x13) (18.1.7) 
and variance 
s? = k(x, x) — k(x, x1:3)k (x1:3,x1:3)7 k(x, x1:3) (18.1.8) 


In this introductory notebook, we have been considering noise free observations. As we will 
see, it is easy to include observation noise. If we assume that the data are generated from a 
latent noise free function f(x) plus iid Gaussian noise e(x) ~ N (0, 07) with variance g?, 
then our covariance function simply becomes k(x;, xj) — k(xi,x;)+ Sijo’, where 6;; = | 


if i = j and 0 otherwise. 


We have already started getting some intuition about how we can use a Gaussian process 
to specify a prior and posterior over solutions, and how the kernel function affects the 
properties of these solutions. In the following notebooks, we will precisely show how to 
specify a Gaussian process prior, introduce and derive various kernel functions, and then 
go through the mechanics of how to automatically learn kernel hyperparameters, and form 
a Gaussian process posterior to make predictions. While it takes time and practice to get 
used to concepts such as a “distributions over functions”, the actual mechanics of finding 
the GP predictive equations is actually quite simple — making it easy to get practice to 
form an intuitive understanding of these concepts. 


18.1.1 Summary 


In typical machine learning, we specify a function with some free parameters (such as 
a neural network and its weights), and we focus on estimating those parameters, which 
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may not be interpretable. With a Gaussian process, we instead reason about distributions 
over functions directly, which enables us to reason about the high-level properties of the 
solutions. These properties are controlled by a covariance function (kernel), which often 
has a few highly interpretable hyperparameters. These hyperparameters include the length- 
scale, which controls how rapidly (how wiggily) the functions are. Another hyperparameter 
is the amplitude, which controls the vertical scale over which our functions are varying. 
Representing many different functions that can fit the data, and combining them all together 
into a predictive distribution, is a distinctive feature of Bayesian methods. Because there 
is a greater amount of variability between possible solutions far away from the data, our 
uncertainty intuitively grows as we move from the data. 


A Gaussian process represents a distribution over functions by specifying a multivariate 
normal (Gaussian) distribution over all possible function values. It is possible to easily 
manipulate Gaussian distributions to find the distribution of one function value based on 
the values of any set of other values. In other words, if we observe a set of points, then we 
can condition on these points and infer a distribution over what the value of the function 
might look like at any other input. How we model the correlations between these points is 
determined by the covariance function and is what defines the generalization properties of 
the Gaussian process. While it takes time to get used to Gaussian processes, they are easy 
to work with, have many applications, and help us understand and develop other model 
classes, like neural networks. 


18.1.2 Exercises 


1. What is the difference between epistemic uncertainty versus observation uncertainty? 


2. Besides rate of variation and amplitude, what other properties of functions might we 
want to consider, and what would be real-world examples of functions that have those 
properties? 


3. The RBF covariance function we considered says that covariances (and correlations) 
between observations decrease with their distance in the input space (times, spatial lo- 
cations, etc.). Is this a reasonable assumption? Why or why not? 


4. Is a sum of two Gaussian variables Gaussian? Is a product of two Gaussian variables 
Gaussian? If (a,b) have a joint Gaussian distribution, is alb (a given b) Gaussian? Is a 
Gaussian? 


5. Repeat the exercise where we observe a data point at f (x1) = 1.2, but now suppose we 
additionally observe f(x2) = 1.4. Let k(x,x,) = 0.9, and k(x, x2) = 0.8. Will we be 
more or less certain about the value of f(x), than when we had only observed f(x)? 
What is the mean and 95% credible set for our value of f(x) now? 


6. Do you think increasing our estimate of observation noise would increase or decrease 
our estimate of the length-scale of the ground truth function? 


7. As we move away from the data, suppose the uncertainty in our predictive distribution 
increases to a point, then stops increasing. Why might that happen? 
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Discussions?°9. 


18.2 Gaussian Process Priors 


Understanding Gaussian processes (GPs) is important for reasoning about model construc- 
tion and generalization, and for achieving state-of-the-art performance in a variety of appli- 
cations, including active learning, and hyperparameter tuning in deep learning. GPs are ev- 
erywhere, and it is in our interests to know what they are and how we can use them. 


In this section, we introduce Gaussian process priors over functions. In the next notebook, 
we show how to use these priors to do posterior inference and make predictions. The 
next section can be viewed as “GPs in a nutshell”, quickly giving what you need to apply 
Gaussian processes in practice. 


import numpy as np 
from scipy.spatial import distance_matrix 
from d21 import torch as d21 


d21.set_figsize() 


18.2.1 Definition 


A Gaussian process is defined as a collection of random variables, any finite number of 
which have a joint Gaussian distribution. If a function f(x) is a Gaussian process, with 
mean function m(x) and covariance function or kernel k(x, x’), f(x) ~ GP (m, k), then any 
collection of function values queried at any collection of input points x (times, spatial lo- 
cations, image pixels, etc.), has a joint multivariate Gaussian distribution with mean vector 
4 and covariance matrix K: f(x1),...,f(4%n) ~ N(u, K), where u; = E[f(x;)] = m(x) 
and Kij = Cov(f (xi), f(x;)) = k(x, xj). 


This definition may seem abstract and inaccessible, but Gaussian processes are in fact very 
simple objects. Any function 


f(x) = w" g(x) = w, ¢(x)), (18.2.1) 


with w drawn from a Gaussian (normal) distribution, and ¢ being any vector of basis func- 
tions, for example #(x) = (1, x, x°, ... xf)", is a Gaussian process. Moreover, any Gaus- 
sian process f(x) can be expressed in the form of equation (18.2.1). Let’s consider a few 
concrete examples, to begin getting acquainted with Gaussian processes, after which we 


can appreciate how simple and useful they really are. 


18.2.2 A Simple Gaussian Process 
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Suppose f(x) = wo + w 1x, and wo, w, ~ N(0, 1), with wo, w1,~ all in one dimension. 
We can equivalently write this function as the inner product f(x) = (wọ, w1)(1, x)". In 
(18.2.1) above, w = (wo, w1)" and @(x) = (1, x)". 


For any x, f(x) is a sum of two Gaussian random variables. Since Gaussians are closed 
under addition, f(x) is also a Gaussian random variable for any x. In fact, we can compute 
for any particular x that f(x) is N(0, 1 + x). Similarly, the joint distribution for any col- 
lection of function values, (f (x1), ..., f (Xn)), for any collection of inputs x1, ...,Xn, is a 
multivariate Gaussian distribution. Therefore f(x) is a Gaussian process. 


In short, f(x) is a random function, or a distribution over functions. We can gain some 
insights into this distribution by repeatedly sampling values for wo, w1, and visualizing the 
corresponding functions f(x), which are straight lines with slopes and different intercepts, 
as follows: 


def lin_func(x, n_sample): 
preds = np.zeros((n_sample, x.shape[Q])) 
for ii in range(n_sample): 
w = np.random.normal(@, 1, 2) 
y = wl@] + w[1] * x 
preds[ii, :] = y 
return preds 


x_points = np.linspace(-5, 5, 50) 

outs = lin_func(x_points, 10) 

lw_bd = -2 * np.sqrt((1 + x_points ** 2)) 
up_bd = 2 * np.sqrt((1 + x_points ** 2)) 


d21.plt.fill_between(x_points, lw_bd, up_bd, alpha=0. 25) 
d21.plt.plot(x_points, np.zeros(len(x_points)), linewidth=4, color='black’) 
d21.plt.plot(x_points, outs.T) 

d21.plt.xlabel("x", fontsize=20) 

d21.plt.ylabel("f(x)", fontsize=20) 

d21.plt.show() 
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If wo and w4 are instead drawn from N (0, a), how do you imagine varying « affects the 
distribution over functions? 


18.2.3 From Weight Space to Function Space 
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In the plot above, we saw how a distribution over parameters in a model induces a distri- 
bution over functions. While we often have ideas about the functions we want to model — 
whether they’re smooth, periodic, quickly varying, etc. — it is relatively tedious to reason 
about the parameters, which are largely uninterpretable. Fortunately, Gaussian processes 
provide an easy mechanism to reason directly about functions. Since a Gaussian distribu- 
tion is entirely defined by its first two moments, its mean and covariance matrix, a Gaussian 
process by extension is defined by its mean function and covariance function. 


In the above example, the mean function 
m(x) = E[f(x)] = E[wo + wix] = E[wo] + E[wi]x =04+0=0. (18.2.2) 
Similarly, the covariance function is 


k(x, x’) = Cov(f (x), f(x’) = EIFS - ELF@DIELS(@’)] = Ewe + wowix’ + wiwox + w7xx’] = 1+ xx’. 
(18.2.3) 


Our distribution over functions can now be directly specified and sampled from, without 
needing to sample from the distribution over parameters. For example, to draw from f(x), 
we can simply form our multivariate Gaussian distribution associated with any collection of 
x we want to query, and sample from it directly. We will begin to see just how advantageous 
this formulation will be. 


First, we note that essentially the same derivation for the simple straight line model above 
can be applied to find the mean and covariance function for any model of the form f(x) = 
w! d(x), with w ~ N(u,S). In this case, the mean function m(x) = u' (x), and the 
covariance function k(x, x’) = (x) S(x’). Since (x) can represent a vector of any non- 
linear basis functions, we are considering a very general model class, including models 
with an even an infinite number of parameters. 


18.2.4 The Radial Basis Function (RBF) Kernel 


The radial basis function (RBF) kernel is the most popular covariance function for Gaus- 
sian processes, and kernel machines in general. This kernel has the form kppr(x,x’) = 
a’ exp (-5 [|x — x’ i where a is an amplitude parameter, and £ is a lengthscale hyper- 
parameter. 


Let’s derive this kernel starting from weight space. Consider the function 


2 


J 2 
F(x) =X widi (x), wi ~ N b. Z) , $i (x) = exp [S| | (18.2.4) 
ial 


2? 


f(x) is a sum of radial basis functions, with width Z, centred at the points c;, as shown in 
the following figure. 


We can recognize f(x) as having the form w' (x), where w = (w1,...,wy)! and ¢(x) 
is a vector containing each of the radial basis functions. The covariance function of this 
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Gaussian process is then 


9. wi 
k(x!) = D1 bia) dil’). (18.2.5) 
j=l 


Now let’s consider what happens as we take the number of parameters (and basis functions) 
to infinity. Let cy = logJ, cı = — log J, and Ci+1 — ci = Ac = 2287, and J — œ. The 


covariance function becomes the Riemann sum: 


2 J Cis 
' . 0 , , 
kaa) = tim ZD OA) = S betae. (18.2.6) 
J>% J i=l co 
By setting co = —o and Cœ% = œ, we spread the infinitely many basis functions across the 


whole real line, each a distance Ac — 0 apart: 


co = 2; 1 2 _ yn2 
k(x’) =f exp? )exp( SS ie = Vito? epl STE) = keela’): 


(18.2.7) 


It is worth taking a moment to absorb what we have done here. By moving into the function 
space representation, we have derived how to represent a model with an infinite number of 
parameters, using a finite amount of computation. A Gaussian process with an RBF kernel 
is a universal approximator, capable of representing any continuous function to arbitrary 
precision. We can intuitively see why from the above derivation. We can collapse each 
radial basis function to a point mass taking £ — 0, and give each point mass any height we 
wish. 


So a Gaussian process with an RBF kernel is a model with an infinite number of param- 
eters and much more flexibility than any finite neural network. Perhaps all the fuss about 
overparametrized neural networks is misplaced. As we will see, GPs with RBF kernels 
do not overfit, and in fact provide especially compelling generalization performance on 
small datasets. Moreover, the examples in (Zhang et al., 2021), such as the ability to fit 
images with random labels perfectly, but still generalize well on structured problems, (can 
be perfectly reproduced using Gaussian processes) (Wilson and Izmailov, 2020). Neural 
networks are not as distinct as we make them out to be. 


We can build further intuition about Gaussian processes with RBF kernels, and hyperpa- 
rameters such as length-scale, by sampling directly from the distribution over functions. 
As before, this involves a simple procedure: 


1. Choose the input x points we want to query the GP: x1,...,Xpn. 
2. Evaluate m(x;),i = 1,...,n, and k(x;,x;) for i, j = 1,...,n to respectively form the 
mean vector and covariance matrix u and K, where (f(x1),..., f(%n)) ~ N(u, K). 


3. Sample from this multivariate Gaussian distribution to obtain the sample function val- 
ues. 


4. Sample more times to visualize more sample functions queried at those points. 


We illustrate this process in the figure below. 
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def rbfkernel(x1, x2, 1s=4.): #@save 
dist = distance_matrix(np.expand_dims(x1, 1), np.expand_dims(x2, 1)) 
return np.exp(-(1. / ls / 2) x (dist xx 2)) 


x_points = np.linspace(@, 5, 50) 
meanvec = np.zeros(len(x_points)) 
covmat = rbfkernel(x_points,x_points, 1) 


prior_samples= np.random.multivariate_normal(meanvec, covmat, size=5); 
d21.plt.plot(x_points, prior_samples.T, alpha=@.5) 
d21.plt.show() 


18.2.5 The Neural Network Kernel 


Research on Gaussian processes in machine learning was triggered by research on neu- 
ral networks. Radford Neal was pursuing ever larger Bayesian neural networks, ultimately 
showing in 1994 (later published in 1996, as it was one of the most infamous NeurIPS 
rejections) that such networks with an infinite number of hidden units become Gaussian 
processes with particular kernel functions (Neal, 1996). Interest in this derivation has re- 
surfaced, with ideas like the neural tangent kernel being used to investigate the generaliza- 
tion properties of neural networks (Matthews et al., 2018) (Novak et al., 2018). We can 
derive the neural network kernel as follows. 


Consider a neural network function f(x) with one hidden layer: 


J 
f(x) = b+)  vih(x; ui). (18.2.8) 
i=l 
b is a bias, v; are the hidden to output weights, h is any bounded hidden unit transfer 
function, u; are the input to hidden weights, and J is the number of hidden units. Let b and 
v; be independent with zero mean and variances A and g2/J, respectively, and let the u; 
have independent identical distributions. We can then use the central limit theorem to show 
that any collection of function values f(x1),..., f(xn) has a joint multivariate Gaussian 
distribution. 


The mean and covariance function of the corresponding Gaussian process are: 


m(x) = E[f(x)] =0 (18.2.9) 


814 Gaussian Processes 


J 


k(x, x’) = cov[ f(x), fO] = ELFO FO) = 0% + T X OPE Ula (x; u)hi(x"; ui) 
i=1 
(18.2.10) 


In some cases, we can essentially evaluate this covariance function in closed form. Let 
EERS P = B Et 
h(x;u) = erf(uọ + Dial ujxj), where erf(z) = Jah e™ dt, andu ~ N(0,Z). Then 


é ST yy 
k(x, x’) = 2 sin( cad: 
a: (14287 ZX) (142K/T DK’) 


The RBF kernel is stationary, meaning that it is translation invariant, and therefore can 
be written as a function of tT = x — x’. Intuitively, stationarity means that the high-level 
properties of the function, such as rate of variation, do not change as we move in input 
space. The neural network kernel, however, is non-stationary. Below, we show sample 
functions from a Gaussian process with this kernel. We can see that the function looks 
qualitatively different near the origin. 


18.2.6 Summary 


The first step in performing Bayesian inference involves specifying a prior. Gaussian pro- 
cesses can be used to specify a whole prior over functions. Starting from a traditional 
“weight space” view of modelling, we can induce a prior over functions by starting with the 
functional form of a model, and introducing a distribution over its parameters. We can al- 
ternatively specify a prior distribution directly in function space, with properties controlled 
by akernel. The function-space approach has many advantages. We can build models that 
actually correspond to an infinite number of parameters, but use a finite amount of com- 
putation! Moreover, while these models have a great amount of flexibility, they also make 
strong assumptions about what types of functions are a priori likely, leading to relatively 
good generalization on small datasets. 


The assumptions of models in function space are intuitively controlled by kernels, which of- 
ten encode higher level properties of functions, such as smoothness and periodicity. Many 
kernels are stationary, meaning that they are translation invariant. Functions drawn from 
a Gaussian process with a stationary kernel have roughly the same high-level properties 
(such as rate of variation) regardless of where we look in the input space. 


Gaussian processes are a relatively general model class, containing many examples of mod- 
els we are already familiar with, including polynomials, Fourier series, and so on, as long 
as we have a Gaussian prior over the parameters. They also include neural networks with 
an infinite number of parameters, even without Gaussian distributions over the parameters. 
This connection, discovered by Radford Neal, triggered machine learning researchers to 
move away from neural networks, and towards Gaussian processes. 


18.2.7 Exercises 


1. Draw sample prior functions from a GP with an Ornstein-Uhlenbeck (OU) kernel, koy(x, x’) = 
exp (-+ [|x — x’ I). If you fix the lengthscale ¢ to be the same, how do these functions 
look different than sample functions from a GP with an RBF kernel? 
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2. How does changing the amplitude a? of the RBF kernel affect the distribution over 
functions? 


3. Suppose we form u(x) = f(x)+2g(x), where f(x) ~ GP (mı, kı) and g(x) ~ GP (m, k2). 
Is u(x) a Gaussian process, and if so, what is its mean and covariance function? 

4. Suppose we form g(x) = a(x) f(x), where f(x) ~ GP(0,k) and a(x) = x’. Is g(x) 
a Gaussian process, and if so, what is its mean and covariance function? What is the 
effect of a(x)? What do sample functions drawn from g(x) look like? 


5. Suppose we form u(x) = f(x)g(x), where f(x) ~ GP (mı, kı) and g(x) ~ GP (m, k2). 
Is u(x) a Gaussian process, and if so, what is its mean and covariance function? 


Discussions? 


18.3 Gaussian Process Inference 
| 


In this section, we will show how to perform posterior inference and make predictions using 
the GP priors we introduced in the last section. We will start with regression, where we can 
perform inference in closed form. This is a “GPs in a nutshell” section to quickly get up 
and running with Gaussian processes in practice. We’ll start coding all the basic operations 
from scratch, and then introduce GPyTorch?©", which will make working with state-of-the- 


261 ; ; ; f ; 
EE art Gaussian processes and integration with deep neural networks much more convenient. 
i z We will consider these more advanced topics in depth in the next section. In that section, 


we will also consider settings where approximate inference is required — classification, 
point processes, or any non-Gaussian likelihoods. 


18.3.1 Posterior Inference for Regression 


An observation model relates the function we want to learn, f(x), to our observations 
y(x), both indexed by some input x. In classification, x could be the pixels of an image, 
and y could be the associated class label. In regression, y typically represents a continuous 
output, such as a land surface temperature, a sea-level, a CO2 concentration, etc. 


In regression, we often assume the outputs are given by a latent noise-free function f(x) 
plus i.i.d. Gaussian noise e(x): 


y(x) = f(x) + €(x), (18.3.1) 
with e(x) ~ N(0,07). Let y = y(X) = (y(x1),...,y(%n))" be a vector of our training 
observations, and f = (f(x1),..., f(%,))' be a vector of the latent noise-free function 
values, queried at the training inputs X = x1,...,Xp. 


We will assume f(x) ~ GP (m, k), which means that any collection of function values f 
has a joint multivariate Gaussian distribution, with mean vector u; = m(x;) and covariance 


matrix Kj; = k(x;,x;). The RBF kernel k(x;,x;) = a? exp (— shall -x;l) would be a 
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standard choice of covariance function. For notational simplicity, we will assume the mean 
function m(x) = 0; our derivations can easily be generalized later on. 


Suppose we want to make predictions at a set of inputs 
X= ka Kw os Kane (18.3.2) 


Then we want to find x? and p(f,|y, X). In the regression setting, we can conveniently 
find this distribution by using Gaussian identities, after finding the joint distribution over 
f, = f(X.) and y. 


If we evaluate equation (18.3.1) at the training inputs X, we have y = f + ff. By the 
definition of a Gaussian process (see last section), f ~ N (0, K(X, X)) where K(X, X) is 
an n X n matrix formed by evaluating our covariance function (aka kernel) at all possible 
pairs of inputs x;,x; € X. ffl is simply a vector comprised of iid samples from N (0, o°) 
and thus has distribution N(0,07/). y is therefore a sum of two independent multivariate 
Gaussian variables, and thus has distribution N (0, K(X, X) + 07). One can also show 
that cov(f,, y) = cov(y, f+)” = K(X., X) where K(X., X) is an m x n matrix formed by 
evaluating the kernel at all pairs of test and training inputs. 


(18.3.3) 


M = 
x n [o,a = 


K(X,X)+0°I K(X,X.) 
K(X,, X) K(X,, Xx) 


We can then use standard Gaussian identities to find the conditional distribution from the 
joint distribution (see, e.g., Bishop Chapter 2), f,|y, X, X; ~ N(m.,S,), where m, = 
K(X,, X)[K(X, X)+o7]] ty, and S = K(X,, X,)—K (X,, X)[K(X, X)+o7 I] “K(X, X,). 


Typically, we do not need to make use of the full predictive covariance matrix S, and in- 
stead use the diagonal of S for uncertainty about each prediction. Often for this reason we 
write the predictive distribution for a single test point x,, rather than a collection of test 
points. 


The kernel matrix has parameters @ that we also wish to estimate, such the amplitude a and 
lengthscale £ of the RBF kernel above. For these purposes we use the marginal likelihood, 
P(y|@, X), which we already derived in working out the marginal distributions to find the 
joint distribution over y, f.. As we will see, the marginal likelihood compartmentalizes into 
model fit and model complexity terms, and automatically encodes a notion of Occam’s razor 
for learning hyperparameters. For a full discussion, see MacKay Ch. 28 (MacKay, 2003), 
and Rasmussen and Williams Ch. 5 (Rasmussen and Williams, 2006). 


import math 

import os 

import gpytorch 

import matplotlib.pyplot as plt 

import numpy as np 

import torch 

from scipy import optimize 

from scipy.spatial import distance_matrix 
from d21 import torch as d21 


d21.set_figsize() 
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18.3.2 Equations for Making Predictions and Learning Kernel 
Hyperparameters in GP Regression 


We list here the equations you will use for learning hyperparameters and making predictions 
in Gaussian process regression. Again, we assume a vector of regression targets y, indexed 
by inputs X = {x1, . . . , Xn}, and we wish to make a prediction at a test input x,. We assume 
iid. additive zero-mean Gaussian noise with variance 0”. We use a Gaussian process 
prior f(x) ~ GP (m, k) for the latent noise-free function, with mean function m and kernel 
function k. The kernel itself has parameters @ that we want to learn. For example, if we 
use an RBF kernel, k(x;,x;) = a’ exp (-shllz — x’ | È); we want to learn @ = {a?, £7}. Let 
K(X, X) represent an n X n matrix corresponding to evaluating the kernel for all possible 
pairs of n training inputs. Let K(x, X) represent a 1 x n vector formed by evaluating 
k(x,x;), 1 = 1,...,n. Let u be a mean vector formed by evaluating the mean function 
m(x) at every training points x. 


Typically in working with Gaussian processes, we follow a two-step procedure. 1. Learn 
kernel hyperparameters Ô by maximizing the marginal likelihood with respect to these hy- 
perparameters. 2. Use the predictive mean as a point predictor, and 2 times the predictive 
standard deviation to form a 95% credible set, conditioning on these learned hyperparam- 
eters 0. 


The log marginal likelihood is simply a log Gaussian density, which has the form: 
1 1 
log p(yl8, X) = -5y" [Ko (X, X) +o°l] 'y - 5 log|Ko(X,X)|+e (18.3.4) 


The predictive distribution has the form: 


P(y«lXs,Y, 0) = N (ax, Vs) (18.3.5) 
a, = ko(x,,X)[Ko(X, X) + 0° I! (y - W) +u (18.3.6) 
va = ko (Xz, X+) — Ko (xx, X)[Ko(X, X) +071] ko (X, x) (18.3.7) 


18.3.3 Interpreting Equations for Learning and Predictions 


There are some key points to note about the predictive distributions for Gaussian pro- 
cesses: 


e Despite the flexibility of the model class, it is possible to do exact Bayesian inference for 
GP regression in closed form. Aside from learning the kernel hyperparameters, there 
is no training. We can write down exactly what equations we want to use to make 
predictions. Gaussian processes are relatively exceptional in this respect, and it has 
greatly contributed to their convenience, versatility, and continued popularity. 


e The predictive mean a, is a linear combination of the training targets y, weighted by the 
kernel ko (xx, X)[Ko(X, X)+o7/]~!. As we will see, the kernel (and its hyperparam- 
eters) thus plays a crucial role in the generalization properties of the model. 
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e The predictive mean explicitly depends on the target values y but the predictive variance 
does not. The predictive uncertainty instead grows as the test input x, moves away 
from the target locations X, as governed by the kernel function. However, uncertainty 
will implicitly depend on the values of the targets y through the kernel hyperparameters 
6, which are learned from the data. 


e The marginal likelihood compartmentalizes into model fit and model complexity (log 
determinant) terms. The marginal likelihood tends to select for hyperparameters that 
provide the simplest fits that are still consistent with the data. 


e The key computational bottlenecks come from solving a linear system and computing 
a log determinant over an n x n symmetric positive definite matrix K(X, X) for n 
training points. Naively, these operations each incur O(n*) computations, as well as 
O(n?) storage for each entry of the kernel (covariance) matrix, often starting with a 
Cholesky decomposition. Historically, these bottlenecks have limited GPs to problems 
with fewer than about 10,000 training points, and have given GPs a reputation for 
“being slow” that has been inaccurate now for almost a decade. In advanced topics, 
we will discuss how GPs can be scaled to problems with millions of points. 


e For popular choices of kernel functions, K(X, X) is often close to singular, which can 
cause numerical issues when performing Cholesky decompositions or other opera- 
tions intended to solve linear systems. Fortunately, in regression we are often working 
with Kg(X, X) + oI, such that the noise variance ga? gets added to the diagonal of 
K(X, X), significantly improving its conditioning. If the noise variance is small, or 
we are doing noise free regression, it is common practice to add a small amount of 
“jitter” to the diagonal, on the order of 1076, to improve conditioning. 


18.3.4 Worked Example from Scratch 


Let’s create some regression data, and then fit the data with a GP, implementing every step 
from scratch. We’ll sample data from 


y(x) = sin(x) + l sin(4x) +€, (18.3.8) 


with €e ~ N (0, o°). The noise free function we wish to find is f(x) = sin(x) + 5 sin(4x). 
We’ ll start by using a noise standard deviation ø = 0.25. 


def data_maker1(x, sig): 
return np.sin(x) + 0.5 * np.sin(4 * x) + np.random.randn(x.shape[Q]) * sig 


sig = 0.25 
train_x, test_x = np.linspace(@, 5, 50), np.linspace(@, 5, 500) 
train_y, test_y = data_maker1(train_x, sig=sig), data_makerl(test_x, sig=Q.) 


d21.plt.scatter(train_x, train_y) 
d21.plt.plot(test_x, test_y) 
d21.plt.xlabel("x", fontsize=20) 
d21.plt.ylabel("Observations y”, fontsize=20) 
d21.plt.show() 
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Observations y 


Here we see the noisy observations as circles, and the noise-free function in blue that we 
wish to find. 


Now, let’s specify a GP prior over the latent noise-free function, f(x) ~ GP (m, k). We'll 
use a mean function m(x) = 0, and an RBF covariance function (kernel) 


1 
k(xi, xj) =a’ exp (- alls 17} ; (18.3.9) 


mean = np.zeros(test_x.shape[Q]) 
cov = d2l.rbfkernel(test_x, test_x, ls=Q.2) 


We have started with a length-scale of 0.2. Before we fit the data, it is important to consider 
whether we have specified a reasonable prior. Let’s visualize some sample functions from 
this prior, as well as the 95% credible set (we believe there’s a 95% chance that the true 
function is within this region). 


prior_samples = np.random.multivariate_normal(mean=mean, cov=cov, size=5) 

d21.plt.plot(test_x, prior_samples.T, color='black', alpha=0.5) 

d21.plt.plot(test_x, mean, linewidth=2.) 

d21.plt.fill_between(test_x, mean - 2 * np.diag(cov), mean + 2 * np.diag(cov), 
alpha=0. 25) 

d21.plt.show() 


Do these samples look reasonable? Are the high-level properties of the functions aligned 
with the type of data we are trying to model? 


Now let’s form the mean and variance of the posterior predictive distribution at any arbitrary 
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test point x... 


fa = K(x, x)" (K(x, x) + o° D ly (18.3.10) 


V( fe) = K (Xa, Xx) — K(x, x4)" (K(x, x) + ° D'K (x, 4) (18.3.11) 


Before we make predictions, we should learn our kernel hyperparameters 6 and noise vari- 
ance a. Let’s initialize our length-scale at 0.75, as our prior functions looked too quickly 
varying compared to the data we are fitting. We’ll also guess a noise standard deviation o 
of 0.75. 


In order to learn these parameters, we will maximize the marginal likelihood with respect 
to these parameters. 


log p(y[X) = Jog J POI, X)p(fIXdf (18.3.12) 


1 1 
log p(y|X) = -59 (K(x) erly y= 5 log |K (x, x) + 071 - 5 log2a (18.3.13) 


Perhaps our prior functions were too quickly varying. Let’s guess a length-scale of 0.4. 
We’ll also guess a noise standard deviation of 0.75. These are simply hyperparameter ini- 
tializations — we will learn these parameters from the marginal likelihood. 


ell_est = 0.4 
post_sig_est = 0.5 


def neg_MLL(pars): 
K = d21.rbfkernel(train_x, train_x, ls=pars[Q]) 
kernel_term = -0.5 x train_y @ \ 
np.linalg.inv(K + pars[1] ** 2 x np.eye(train_x.shapel0])) @ train_y 
logdet = -0.5 * np.log(np.linalg.det(K + pars[1] ** 2 * \ 
np.eye(train_x.shapeLl0]))) 
const = -train_x.shape[@] / 2. x np.log(2 * np.pi) 


return -(kernel_term + logdet + const) 


learned_hypers = optimize.minimize(neg_MLL, xQ=np.array(L[ell_est,post_sig_ 
oest]), 
bounds=((@.01, 10.), (@.01, 10.))) 
ell = learned_hypers.x[0] 
post_sig_est = learned_hypers.x[1] 


In this instance, we learn a length-scale of 0.299, and a noise standard deviation of 0.24. 
Note that the learned noise is extremely close to the true noise, which helps indicate that 
our GP is a very well-specified to this problem. 


In general, it is crucial to put careful thought into selecting the kernel and initializing the 
hyperparameters. While marginal likelihood optimization can be relatively robust to ini- 
tialization, it is not immune to poor initializations. Try running the above script with a 
variety of initializations and see what results you find. 


Now, let’s make predictions with these learned hypers. 
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K_x_xstar = d21.rbfkernel(train_x, test_x, ls=ell) 
K_x_x = d21.rbfkernel(train_x, train_x, ls=ell) 
K_xstar_xstar = d21.rbfkernel(test_x, test_x, ls=el1) 


post_mean = K_x_xstar.T @ np.linalg.inv((K_x_x + \ 
post_sig_est xx 2 x np.eye(train_x.shape[Q]))) @ train_y 
post_cov = K_xstar_xstar - K_x_xstar.T @ np.linalg.inv((K_x_x + \ 
post_sig_est xx 2 x np.eye(train_x.shapelQ]))) @ K_x_xstar 


lw_bd = post_mean - 2 * np.sqrt(np.diag(post_cov)) 
up_bd = post_mean + 2 * np.sqrt(np.diag(post_cov)) 


d21.plt.scatter(train_x, train_y) 

d21.plt.plot(test_x, test_y, linewidth=2.) 

d21.plt.plot(test_x, post_mean, linewidth=2.) 

d21.plt.fill_between(test_x, lw_bd, up_bd, alpha=0. 25) 
d21.plt.legend(L’Observed Data’, 'True Function’, ‘Predictive Mean’, '95% Set. 
on True Func’ ]) 

d21.plt.show() 


@ Observed Data 
| — True Function 
—— Predictive Mean 


95% Set on True Func bs 


0 1 2 3 4 5 


We see the posterior mean in orange almost perfectly matches the true noise free function! 
Note that the 95% credible set we are showing is for the latent noise free (true) function, 
and not the data points. We see that this credible set entirely contains the true function, and 
does not seem overly wide or narrow. We would not want nor expect it to contain the data 
points. If we wish to have a credible set for the observations, we should compute 


lw_bd_observed = post_mean - 2 x np.sqrt(np.diag(post_cov) + post_sig_est ** 2) 
up_bd_observed = post_mean + 2 * np.sqrt(np.diag(post_cov) + post_sig_est ** 2) 


There are two sources of uncertainty, epistemic uncertainty, representing reducible uncer- 
tainty, and aleatoric or irreducible uncertainty. The epistemic uncertainty here represents 
uncertainty about the true values of the noise free function. This uncertainty should grow 
as we move away from the data points, as away from the data there are a greater variety of 
function values consistent with our data. As we observe more and more data, our beliefs 
about the true function become more confident, and the epistemic uncertainty disappears. 
The aleatoric uncertainty in this instance is the observation noise, since the data are given 
to us with this noise, and it cannot be reduced. 


The epistemic uncertainty in the data is captured by variance of the latent noise free function 
np.diag(post_cov). The aleatoric uncertainty is captured by the noise variance post_sig_est**2. 
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Unfortunately, people are often careless about how they represent uncertainty, with many 
papers showing error bars that are completely undefined, no clear sense of whether we are 
visualizing epistemic or aleatoric uncertainty or both, and confusing noise variances with 
noise standard deviations, standard deviations with standard errors, confidence intervals 
with credible sets, and so on. Without being precise about what the uncertainty represents, 
it is essentially meaningless. 


In the spirit of playing close attention to what our uncertainty represents, it is crucial to 
note that we are taking two times the square root of our variance estimate for the noise free 
function. Since our predictive distribution is Gaussian, this quantity enables us to form a 
95% credible set, representing our beliefs about the interval which is 95% likely to contain 
the ground truth function. The noise variance is living on a completely different scale, and 
is much less interpretable. 


Finally, let’s take a look at 20 posterior samples. These samples tell us what types of 
functions we believe might fit our data, a posteriori. 


post_samples = np.random.multivariate_normal(post_mean, post_cov, size=20) 
d21.plt.scatter(train_x, train_y) 

d21.plt.plot(test_x, test_y, linewidth=2.) 

d21.plt.plot(test_x, post_mean, linewidth=2.) 

d21.plt.plot(test_x, post_samples.T, color='gray', alpha=0.25) 
d21.plt.fill_between(test_x, lw_bd, up_bd, alpha=0. 25) 

plt.legend(['Observed Data’, ‘True Function’, ‘Predictive Mean’, ‘Posterior. 
Samples’ ]) 

d21.plt.show() 


@ Observed Data 
-14|/— True Function 
—— Predictive Mean 
Posterior Samples 


In basic regression applications, it is most common to use the posterior predictive mean 
and standard deviation as a point predictor and metric for uncertainty, respectively. In 
more advanced applications, such as Bayesian optimization with Monte Carlo acquisition 
functions, or Gaussian processes for model-based RL, it often necessary to take posterior 
samples. However, even if not strictly required in the basic applications, these samples 
give us more intuition about the fit we have for the data, and are often useful to include in 
visualizations. 


18.3.5 Making Life Easy with GPyTorch 


As we have seen, it is actually pretty easy to implement basic Gaussian process regres- 
sion entirely from scratch. However, as soon as we want to explore a variety of kernel 
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choices, consider approximate inference (which is needed even for classification), combine 
GPs with neural networks, or even have a dataset larger than about 10,000 points, then an 
implementation from scratch becomes unwieldy and cumbersome. Some of the most effec- 
tive methods for scalable GP inference, such as SKI (also known as KISS-GP), can require 
hundreds of lines of code implementing advanced numerical linear algebra routines. 


In these cases, the GPyTorch library will make our lives a lot easier. We'll be discussing 
GPyTorch more in future notebooks on Gaussian process numerics, and advanced methods. 
The GPyTorch library contains many examples ?°?. To get a feel for the package, we will 
walk through the simple regression example?©*, showing how it can be adapted to reproduce 
our above results using GPyTorch. This may seem like a lot of code to simply reproduce the 
basic regression above, and in a sense, itis. But we can immediately use a variety of kernels, 
scalable inference techniques, and approximate inference, by only changing a few lines of 
code from below, instead of writing potentially thousands of lines of new code. 


# First let's convert our data into tensors for use with PyTorch 
train_x = torch. tensor(train_x) 

train_y = torch. tensor(train_y) 

test_y = torch.tensor(test_y) 


# We are using exact GP inference with a zero mean and RBF kernel 
class ExactGPModel(gpytorch.models.ExactGP) : 
def __init__(self, train_x, train_y, likelihood): 


super (ExactGPModel, self).__init__(train_x, train_y, likelihood) 

self.mean_module = gpytorch.means.ZeroMean() 

self .covar_module = gpytorch.kernels.ScaleKernel ( 
gpytorch.kernels.RBFKernel()) 


def forward(self, x): 
mean_x = self.mean_module(x) 
covar_x = self.covar_module(x) 
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x) 


This code block puts the data in the right format for GPyTorch, and specifies that we are 
using exact inference, as well the mean function (zero) and kernel function (RBF) that 
we want to use. We can use any other kernel very easily, by calling, for instance, gpy- 
torch.kernels.matern_kernel(), or gpyotrch.kernels.spectral_mixture_kernel(). So far, we 
have only discussed exact inference, where it is possible to infer a predictive distribution 
without making any approximations. For Gaussian processes, we can only perform exact 
inference when we have a Gaussian likelihood; more specifically, when we assume that our 
observations are generated as a noise-free function represented by a Gaussian process, plus 
Gaussian noise. In future notebooks, we will consider other settings, such as classification, 
where we cannot make these assumptions. 


# Initialize Gaussian likelihood 

likelihood = gpytorch. likelihoods.GaussianLikelihood() 
model = ExactGPModel(train_x, train_y, likelihood) 
training_iter = 50 

# Find optimal model hyperparameters 


(continues on next page) 
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(continued from previous page) 


model. train() 

likelihood. train() 

# Use the adam optimizer, includes GaussianLikelihood parameters 
optimizer = torch.optim.Adam(model.parameters(), 1r=0.1) 

# Set our loss as the negative log GP marginal likelihood 

mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model) 


Here, we explicitly specify the likelihood we want to use (Gaussian), the objective we will 
use for training kernel hyperparameters (here, the marginal likelihood), and the procedure 
we we want to use for optimizing that objective (in this case, Adam). We note that while 
we are using Adam, which is a “stochastic” optimizer, in this case, it is full-batch Adam. 
Because the marginal likelihood does not factorize over data instances, we cannot use an 
optimizer over “mini-batches” of data and be guaranteed convergence. Other optimizers, 
such as L-BFGS, are also supported by GPyTorch. Unlike in standard deep learning, doing 
a good job of optimizing the marginal likelihood corresponds strongly with good general- 
ization, which often inclines us towards powerful optimizers like L-BFGS, assuming they 
are not prohibitively expensive. 


for i in range(training_iter): 

# Zero gradients from previous iteration 

optimizer .zero_grad() 

# Output from model 

output = model(train_x) 

# Calc loss and backprop gradients 

loss = -mll(output, train_y) 

loss. backward() 

if i % 10 == Q: 

print(f'Iter {it1:d}/{training_iter:d} - Loss: {loss.item():.3f} ' 

f'’squared lengthscale: ' 
f'{model.covar_module.base_kernel.lengthscale.item():.3f} ' 
f'noise variance: {model.likelihood.noise.item(): .3f}') 

optimizer .step() 


Iter 1/50 - Loss: 1.000 squared lengthscale: 2.693 noise variance: 0.693 
Iter 11/50 - Loss: 0.711 squared lengthscale: 0.490 noise variance: 0.312 
Iter 21/50 - Loss: 0.451 squared lengthscale: 0.506 noise variance: 0.127 
Iter 31/50 - Loss: 0.330 squared lengthscale: 0.485 noise variance: 0.055 
Iter 41/50 - Loss: 0.344 squared lengthscale: 0.472 noise variance: 0.038 


Here we actually run the optimization procedure, outputting the values of the loss every 10 
iterations. 


# Get into evaluation (predictive posterior) mode 
test_x = torch. tensor(test_x) 

model .eval() 

likelihood.eval() 

observed_pred = likelihood(model(test_x)) 


The above codeblock enables us to make predictions on our test inputs. 
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with torch.no_grad(): 
# Initialize plot 
f, ax = d21.plt.subplots(1, 1, figsize=(4, 3)) 
# Get upper and lower bounds for 95\% credible set (in this case, in 
# observation space) 
lower, upper = observed_pred.confidence_region() 
ax.scatter(train_x.numpy(), train_y.numpy()) 
ax.plot(test_x.numpy(), test_y.numpy(), linewidth=2.) 
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), linewidth=2.) 
ax. fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0. 25) 
ax.set_ylim([-1.5, 1.5]) 
ax.legend([’True Function’, ‘Predictive Mean’, ‘Observed Data’, 

"95% Credible Set']) 
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Finally, we plot the fit. 


We see the fits are virtually identical. A few things to note: GPyTorch is working with 
squared length-scales and observation noise. For example, our learned noise standard de- 
viation in the for scratch code is about 0.283. The noise variance found by GPyTorch is 
0.81 ~ 0.283%. In the GPyTorch plot, we also show the credible set in the observation 
space rather than the latent function space, to demonstrate that they indeed cover the ob- 
served datapoints. 


18.3.6 Summary 


We can combine a Gaussian process prior with data to form a posterior, which we use to 
make predictions. We can also form a marginal likelihood, which is useful for automatic 
learning of kernel hyperparameters, which control properties such as the rate of variation of 
the Gaussian process. The mechanics of forming the posterior and learning kernel hyperpa- 
rameters for regression are simple, involving about a dozen lines of code. This notebook is 
a good reference for any reader wanting to quickly get “up and running” with Gaussian pro- 
cesses. We also introduced the GPyTorch library. Although the GPyTorch code for basic 
regression is relatively long, it can be trivially modified for other kernel functions, or more 
advanced functionality we will discuss in future notebooks, such as scalable inference, or 
non-Gaussian likelihoods for classification. 


18.3.7 Exercises 
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1. We have emphasized the importance of learning kernel hyperparameters, and the effect 
of hyperparameters and kernels on the generalization properties of Gaussian processes. 
Try skipping the step where we learn hypers, and instead guess a variety of length-scales 
and noise variances, and check their effect on predictions. What happens when you use 
a large length-scale? A small length-scale? A large noise variance? A small noise 
variance? 


2. We have said that the marginal likelihood is not a convex objective, but that hyperpa- 
rameters like length-scale and noise variance can be reliably estimated in GP regression. 
This is generally true — in fact, the marginal likelihood is much better at learning length- 
scale hyperparameters than conventional approaches in spatial statistics, which involve 
fitting empirical autocorrelation functions (“covariograms”). Arguably, the biggest con- 
tribution from machine learning to Gaussian process research, at least before recent work 
on scalable inference, was the introduction of the marginal Ikelihood for hyperparameter 
learning. 


However, different pairings of even these parameters provide interpretably different plau- 
sible explanations for many datasets, leading to local optima in our objective. If we use a 
large length-scale, then we assume the true underlying function is slowly varying. If the 
observed data are varying significantly, then the only we can plausibly have a large length- 
scale is with a large noise-variance. If we use a small length-scale, on the other hand, our fit 
will be very sensitive to the variations in the data, leaving little room to explain variations 
with noise (aleatoric uncertainty). 


Try seeing if you can find these local optima: initialize with very large length-scale with 
large noise, and small length-scales with small noise. Do you converge to different solu- 
tions? 


3. We have said that a fundamental advantage of Bayesian methods is in naturally repre- 
senting epistemic uncertainty. In the above example, we cannot fully see the effects of 
epistemic uncertainty. Try instead to predict with test_x = np.linspace(Q@, 10, 
1000). What happens to the 95% credible set as your predictions move beyond the 
data? Does it cover the true function in that interval? What happens if you only visual- 
ize aleatoric uncertainty in that region? 


4. Try running the above example, but instead with 10,000, 20,000 and 40,000 training 
points, and measure the runtimes. How does the training time scale? Alternatively, how 
do the runtimes scale with the number of test points? Is it different for the predictive 
mean and the predictive variance? Answer this question both by theoretically working 
out the training and testing time complexities, and by running the code above with a 
different number of points. 


5. Try running the GPyTorch example with different covariance functions, such as the 
Matern kernel. How do the results change? How about the spectral mixture kernel, 
found in the GPyTorch library? Are some easier to train the marginal likelihood than 
others? Are some more valuable for long-range versus short-range predictions? 


6. In our GPyTorch example, we plotted the predictive distribution including observation 
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noise, while in our “from scratch” example, we only included epistemic uncertainty. 
Re-do the GPyTorch example, but this time only plotting epistemic uncertainty, and 
compare to the from-scratch results. Do the predictive distributions now look the same? 
(They should.) 


Discussions 2. 


Hyperparameter Optimization 


Aaron Klein (Amazon), Matthias Seeger (Amazon), and Cedric Archambeau (Ama- 
zon) 


The performance of every machine learning model depends on its hyperparameters. They 
control the learning algorithm or the structure of the underlying statistical model. However, 
there is no general way to choose hyperparameters in practice. Instead, hyperparameters 
are often set in a trial-and-error manner or sometimes left to their default values by practi- 
tioners, leading to suboptimal generalization. 


Hyperparameter optimization provides a systematic approach to this problem, by casting 
it as an optimization problem: a good set of hyperparameters should (at least) minimize a 
validation error. Compared to most other optimization problems arising in machine learn- 
ing, hyperparameter optimization is a nested one, where each iteration requires training and 
validating a machine learning model. 


In this chapter, we will first introduce the basics of hyperparameter optimization. We will 
also present some recent advancements that improve the overall efficiency of hyperparame- 
ter optimization by exploiting cheap-to-evaluate proxies of the original objective function. 
At the end of this chapter, you should be able to apply state-of-the-art hyperparameter 
optimization techniques to optimize the hyperparameter of your own machine learning al- 
gorithm. 


19.1 What Is Hyperparameter Optimization? 
SSS SSeS SSS Saad aS 


As we have seen in the previous chapters, deep neural networks come with a large number 
of parameters or weights that are learned during training. On top of these, every neural net- 
work has additional hyperparameters that need to be configured by the user. For example, 
to ensure that stochastic gradient descent converges to a local optimum of the training loss 
(see Chapter 12), we have to adjust the learning rate and batch size. To avoid overfitting on 
training datasets, we might have to set regularization parameters, such as weight decay (see 
Section 3.7) or dropout (see Section 5.6). We can define the capacity and inductive bias of 
the model by setting the number of layers and number of units or filters per layer (i.e., the 
effective number of weights). 
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Unfortunately, we cannot simply adjust these hyperparameters by minimizing the training 
loss, because this would lead to overfitting on the training data. For example, setting reg- 
ularization parameters, such as dropout or weight decay to zero leads to a small training 
loss, but might hurt the generalization performance. 


v 


Set Hyperparameters 
Loop until validation 
performance is maximised 


Train 


Typical workflow in machine learning that consists of training the model multiple times 


with different hyperparameters. 


Without a different form of automation, hyperparameters have to be set manually in a trial- 
and-error fashion, in what amounts to a time-consuming and difficult part of machine learn- 
ing workflows. For example, consider training a ResNet (see Section 8.6) on CIFAR-10, 
which requires more than 2 hours on an Amazon Elastic Cloud Compute (EC2) g4dn. 
xlarge instance. Even just trying ten hyperparameter configurations in sequence, this 
would already take us roughly one day. To make matters worse, hyperparameters are usu- 
ally not directly transferable across architectures and datasets (Bardenet et al., 2013, Feurer 
et al., 2022, Wistuba et al., 2018), and need to be re-optimized for every new task. Also, 
for most hyperparameters, there are no rule-of-thumbs, and expert knowledge is required 
to find sensible values. 


Hyperparameter optimization (HPO) algorithms are designed to tackle this problem in a 
principled and automated fashion (Feurer and Hutter, 2018), by framing it as a global op- 
timization problem. The default objective is the error on a hold-out validation dataset, but 
could in principle be any other business metric. It can be combined with or constrained by 
secondary objectives, such as training time, inference time, or model complexity. 


Recently, hyperparameter optimization has been extended to neural architecture search 
(NAS) (Elsken et al., 2018, Wistuba et al., 2019), where the goal is to find entirely new 
neural network architectures. Compared to classical HPO, NAS is even more expensive in 
terms of computation and requires additional efforts to remain feasible in practice. Both, 
HPO and NAS can be considered as sub-fields of AutoML (Hutter et al., 2019), which aims 
to automate the entire ML pipeline. 


In this section we will introduce HPO and show how we can automatically find the best 
hyperparameters of the logistic regression example introduced in Section 4.5. 


19.1.1 The Optimization Problem 


We will start with a simple toy problem: searching for the learning rate of the multi-class 
logistic regression model Sof tmaxRegression from Section 4.5 to minimize the validation 
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error on the Fashion MNIST dataset. While other hyperparameters like batch size or num- 
ber of epochs are also worth tuning, we focus on learning rate alone for simplicity. 


import numpy as np 

import torch 

from scipy import stats 
from torch import nn 

from d21 import torch as d21 


Before we can run HPO, we first need to define two ingredients: the objective function and 
the configuration space. 


The Objective Function 


The performance of a learning algorithm can be seen as a function f : X — R that maps 
from the hyperparameter space x € X to the validation loss. For every evaluation of f(x), 
we have to train and validate our machine learning model, which can be time and compute 
intensive in the case of deep neural networks trained on large datasets. Given our criterion 
f(x) our goal is to find x, € argmin,. y f(x). 


There is no simple way to compute gradients of f with respect to x, because it would require 
to propagate the gradient through the entire training process. While there is recent work 
(Franceschi et al., 2017, Maclaurin et al., 2015) to drive HPO by approximate “hypergradi- 
ents”, none of the existing approaches are competitive with the state-of-the-art yet, and we 
will not discuss them here. Furthermore, the computational burden of evaluating f requires 
HPO algorithms to approach the global optimum with as few samples as possible. 


The training of neural networks is stochastic (e.g., weights are randomly initialized, mini- 
batches are randomly sampled), so that our observations will be noisy: y ~ f(x)+¢€, where 
we usually assume that the € ~ N(0, o) observation noise is Gaussian distributed. 


Faced with all these challenges, we usually try to identify a small set of well performing 
hyperparameter configurations quickly, instead of hitting the global optima exactly. How- 
ever, due to large computational demands of most neural networks models, even this can 
take days or weeks of compute. We will explore in Section 19.4 how we can speed-up the 
optimization process by either distributing the search or using cheaper-to-evaluate approx- 
imations of the objective function. 


We begin with a method for computing the validation error of a model. 


class HPOTrainer(d21.Trainer): #@save 
def validation_error(self): 

self .model.eval() 

accuracy = @ 

val_batch_idx = @ 

for batch in self.val_dataloader: 

with torch.no_grad(): 

x, y = self.prepare_batch(batch) 
y_hat = self .model (x) 


(continues on next page) 
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accuracy += self.model.accuracy(y_hat, y) 
val_batch_idx += 1 
return 1 - accuracy / val_batch_idx 


We optimize validation error with respect to the hyperparameter configuration config, 
consisting of the learning_rate. For each evaluation, we train our model for max_epochs 
epochs, then compute and return its validation error: 


def hpo_objective_softmax_classification(config, max_epochs=8) : 
learning_rate = configL”learning_rate”] 
trainer = d21.HPOTrainer (max_epochs=max_epochs) 
data = d21.FashionMNIST(batch_size=16) 
model = d21.SoftmaxRegression(num_outputs=10, 1lr=learning_rate) 
trainer. fit(model=model, data=data) 
return trainer.validation_error() .detach() .numpy() 


The Configuration Space 


Along with the objective function f(x), we also need to define the feasible set x € X to 
optimize over, known as configuration space or search space. For our logistic regression 
example, we will use: 


config_space = {"learning_rate”: stats.loguniform(le-4, 1)} 


Here we use the use the loguniform object from SciPy, which represents a uniform distri- 
bution between -4 and -1 in the logarithmic space. This object allows us to sample random 
variables from this distribution. 


Each hyperparameter has a data type, such as float for learning_rate, as well as a closed 
bounded range (i.e., lower and upper bounds). We usually assign a prior distribution (e.g, 
uniform or log-uniform) to each hyperparameter to sample from. Some positive parameters, 
such as learning_rate, are best represented on a logarithmic scale as optimal values can 
differ by several orders of magnitude, while others, such as momentum, come with linear 
scale. 


Below we show a simple example of a configuration space consisting of typical hyperpa- 
rameters of a multi-layer perceptron including their type and standard ranges. 


: Example configuration space of multi-layer perceptron 


Table 19.1.1: label:tab_example_configspace 
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Name Type Hyperparameter log-scale 
Ranges 

learning rate float :math:‘ [10^{- | yes 
6},10%{-1}]‘ 

batch size integer [8, 256] yes 

momentum float [0, 0.99] no 

activation function | categorical :mat è 
h:{textrm{tanh} 
, textrm{relu}} 

number of units integer [32, 1024] yes 

number of layers integer [1,6] no 


In general, the structure of the configuration space X can be complex and it can be quite 
different from R¢. In practice, some hyperparameters may depend on the value of others. 
For example, assume we try to tune the number of layers for a multi-layer perceptron, and 
for each layer the number of units. The number of units of the /-th layer is relevant only if 
the network has at least / + 1 layers. These advanced HPO problems are beyond the scope 
of this chapter. We refer the interested reader to (Baptista and Poloczek, 2018, Hutter et 
al., 2011, Jenatton et al., 2017). 


The configuration space plays an important role for hyperparameter optimization, since 
no algorithms can find something that is not included in the configuration space. On the 
other hand, if the ranges are too large, the computation budget to find well performing 
configurations might become infeasible. 


19.1.2 Random Search 


Random search is the first hyperparameter optimization algorithm we will consider. The 
main idea of random search is to independently sample from the configuration space until 
a predefined budget (e.g maximum number of iterations) is exhausted, and to return the 
best observed configuration. All evaluations can be executed independently in parallel (see 
Section 19.3), but here we use a sequential loop for simplicity. 


errors, values 
num_iterations 


tl, Ie 
5 


for i in range(num_iterations): 
learning_rate = config_spaceL” learning_rate”].rvs() 
print(f"Trial {i}: learning_rate = {learning_rate}") 
y = hpo_objective_softmax_classification({"learning_rate": learning_rate}) 
print(f” validation_error = {y}") 
values. append(learning_rate) 
errors.append(y) 


validation_error = 0.17070001363754272 


The best learning rate is then simply the one with the lowest validation error. 


19.1 What Is Hyperparameter Optimization? 
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best_idx = np.argmin(errors) 
print(f”optimal learning rate = {values[best_idx]}") 


optimal learning rate = 0.09844872561810249 


Due to its simplicity and generality, random search is one of the most frequently used HPO 
algorithms. It does not require any sophisticated implementation and can be applied to 
any configuration space as long as we can define some probability distribution for each 
hyperparameter. 


Unfortunately random search also comes with a few shortcomings. First, it does not adapt 
the sampling distribution based on the previous observations it collected so far. Hence, 
it is equally likely to sample a poorly performing configuration than a better performing 
configuration. Second, the same amount of resources are spent for all configurations, even 
though some may show poor initial performance and are less likely to outperform previously 
seen configurations. 


In the next sections we will look at more sample efficient hyperparameter optimization 
algorithms that overcome the shortcomings of random search by using a model to guide 
the search. We will also look at algorithms that automatically stop the evaluation process 
of poorly performing configurations to speed up the optimization process. 


19.1.3 Summary 


In this section we introduced hyperparameter optimization (HPO) and how we can phrase 
it as a global optimization by defining a configuration space and an objective function. 
We also implemented our first HPO algorithm, random search, and applied it on a simple 
softmax classification problem. 


While random search is very simple, it is the better alternative to grid search, which simply 
evaluates a fixed set of hyperparameters. Random search somewhat mitigates the curse 
of dimensionality (Bellman, 1966), and can be far more efficient than grid search if the 
criterion most strongly depends on a small subset of the hyperparameters. 


19.1.4 Exercises 
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1. In this chapter, we optimize the validation error of a model after training on a disjoint 


training set. For simplicity, our code uses Trainer. val_dataloader, which maps to a 
loader around FashionMNIST. val. 


1. Convince yourself (by looking at the code) that this means we use the original Fash- 
ionMNIST training set (60000 examples) for training, and the original test set (10000 
examples) for validation. 


2. Why could this practice be problematic? Hint: Re-read Section 3.6, especially about 
model selection. 


3. What should we have done instead? 


2. We stated above that hyperparameter optimization by gradient descent is very hard to do. 
Consider a small problem, such as training a two-layer perceptron on the FashionMNIST 
dataset (Section 5.2) with a batch size of 256. We would like to tune the learning rate 
of SGD in order to minimize a validation metric after one epoch of training. 


1. Why cannot we use validation error for this purpose? What metric on the validation 
set would you use? 


2. Sketch (roughly) the computational graph of the validation metric after training for 
one epoch. You may assume that initial weights and hyperparameters (such as learn- 
ing rate) are input nodes to this graph. Hint: Re-read about computational graphs in 
Section 5.3. 


3. Give a rough estimate of the number of floating point values you need to store during 
a forward pass on this graph. Hint: FashionMNIST has 60000 cases. Assume the 
required memory is dominated by the activations after each layer, and look up the 
layer widths in Section 5.2. 


4. Apart from the sheer amount of compute and storage required, what other issues 
would gradient-based hyperparameter optimization run into? Hint: Re-read about 
vanishing and exploding gradients in Section 5.4. 


5. Advanced: Read (Maclaurin et al., 2015) for an elegant (yet still somewhat unprac- 
tical) approach to gradient-based HPO. 


3. Grid search is another HPO baseline, where we define an equi-spaced grid for each hy- 
perparameter, then iterate over the (combinatorial) Cartesian product in order to suggest 
configurations. 


1. We stated above that random search can be much more efficient than grid search for 
HPO on a sizable number of hyperparameters, if the criterion most strongly depends 
on a small subset of the hyperparameters. Why is this? Hint: Read (Bergstra et al., 
2011). 


Discussions 2°. 
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19.2 Hyperparameter Optimization API 
LL L——— SSS SS Ses 


Before we dive into the methodology, we will first discuss a basic code structure that al- 
lows us to efficiently implement various HPO algorithms. In general, all HPO algorithms 
considered here need to implement two decision making primitives, searching and schedul- 
ing. First, they need to sample new hyperparameter configurations, which often involves 
some kind of search over the configuration space. Second, for each configuration, an HPO 
algorithm needs to schedule its evaluation and decide how many resources to allocate for 
it. Once we start to evaluate a configuration, we will refer to it as a trial. We map these 
decisions to two classes, HPOSearcher and HPOScheduler. On top of that, we also provide 
a HPOTuner class that executes the optimization process. 


This concept of scheduler and searcher is also implemented in popular HPO libraries, such 
as Syne Tune (Salinas et al., 2022), Ray Tune (Liaw et al., 2018) or Optuna (Akiba et al., 
2019). 


import time 
from scipy import stats 
from d21 import torch as d21 


19.2.1 Searcher 


Below we define a base class for searchers, which provides a new candidate configuration 
through the sample_configuration function. A simple way to implement this function 
would be to sample configurations uniformly at random, as we did for random search in 
Section 19.1. More sophisticated algorithms, such as Bayesian optimization, will make 
these decisions based on the performance of previous trials. As a result, these algorithms 
are able to sample more promising candidates over time. We add the update function in 
order to update the history of previous trials, which can then be exploited to improve our 
sampling distribution. 


class HPOSearcher(d21.HyperParameters): #@save 
def sample_configuration() -> dict: 
raise NotImplementedError 


def update(self, config: dict, error: float, additional_info=None): 
pass 


The following code shows how to implement our random search optimizer from the pre- 
vious section in this API. As a slight extension, we allow the user to prescribe the first 
configuration to be evaluated via initial_config, while subsequent ones are drawn at 
random. 


class RandomSearcher(HPOSearcher): #@save 


(continues on next page) 
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def __init__(self, config_space: dict, initial_config=None): 
self.save_hyperparameters() 


def sample_configuration(self) -> dict: 
if self.initial_config is not None: 
result = self.initial_config 
self.initial_config = None 


else: 
result = { 
name: domain.rvs() 
for name, domain in self.config_space.items() 
3 


return result 


19.2.2 Scheduler 


Beyond sampling configurations for new trials, we also need to decide when and for how 
long to run a trial. In practice, all these decisions are done by the HPOScheduler, which 
delegates the choice of new configurations to a HPOSearcher. The suggest method is 
called whenever some resource for training becomes available. Apart from invoking sam- 
ple_configuration of a searcher, it may also decide upon parameters like max_epochs 
(i.e., how long to train the model for). The update method is called whenever a trial returns 
a new observation. 


class HPOScheduler(d21.HyperParameters): #@save 
def suggest(self) -> dict: 
raise NotImplementedError 


def update(self, config: dict, error: float, info=None): 
raise NotImplementedError 


To implement random search, but also other HPO algorithms, we only need a basic sched- 
uler that schedules a new configuration every time new resources become available. 


class BasicScheduler(HPOScheduler): #@save 
def __init__(self, searcher: HPOSearcher): 
self .save_hyperparameters() 


def suggest(self) -> dict: 
return self.searcher.sample_configuration() 


def update(self, config: dict, error: float, info=None): 
self.searcher.update(config, error, additional_info=info) 


19.2.3 Tuner 


Finally, we need a component that runs the scheduler/searcher and does some book-keeping 
of the results. The following code implements a sequential execution of the HPO trials that 
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evaluates one training job after the next and will serve as a basic example. We will later 
use Syne Tune for more scalable distributed HPO cases. 


class HPOTuner(d21.HyperParameters): #@save 
def __init__(self, scheduler: HPOScheduler, objective: callable): 

self.save_hyperparameters() 
# Bookeeping results for plotting 
self.incumbent = None 
self.incumbent_error = None 
self.incumbent_trajectory = [] 
self.cumulative_runtime = [] 
self.current_runtime = @ 
self.records = [] 


def run(self, number_of_trials): 
for i in range(number_of_trials): 

start_time = time. time() 
config = self.scheduler.suggest() 
print(f"Trial {i}: config = {config}”) 
error = self .objective(«*config) 
error = float(error.cpu().detach() .numpy()) 
self.scheduler.update(config, error) 
runtime = time.time() - start_time 
self .bookkeeping(config, error, runtime) 
print(f” error = {error}, runtime = {runtime}”) 


19.2.4 Bookkeeping the Performance of HPO Algorithms 


With any HPO algorithm, we are mostly interested in the best performing configuration 
(called incumbent) and its validation error after a given wall-clock time. This is why we 
track runtime per iteration, which includes both the time to run an evaluation (call of 
objective) and the time to make a decision (call of scheduler.suggest). In the se- 
quel, we will plot cumulative_runtime against incumbent_trajectory in order to visu- 
alize the any-time performance of the HPO algorithm defined in terms of scheduler (and 
searcher). This allows us to quantify not only how well the configuration found by an 
optimizer works, but also how quickly an optimizer is able to find it. 


@d21.add_to_class(HPOTuner) #@save 
def bookkeeping(self, config: dict, error: float, runtime: float): 
self.records.append({"config”: config, "error": error, “runtime”: runtime}) 
# Check if the last hyperparameter configuration performs better 
# than the incumbent 
if self.incumbent is None or self.incumbent_error > error: 
self.incumbent = config 
self.incumbent_error = error 
# Add current best observed performance to the optimization trajectory 
self .incumbent_trajectory.append(self.incumbent_error) 
# Update runtime 
self.current_runtime += runtime 
self.cumulative_runtime.append(self.current_runtime) 
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19.2.5 Example: Optimizing the Hyperparameters of a Convolutional 
Neural Network 


We now use our new implementation of random search to optimize the batch size and 
learning rate of the LeNet convolutional neural network from Section 7.6. We being by 
defining the objective function, which will once more be validation error. 


def hpo_objective_lenet(learning_rate, batch_size, max_epochs=10): #@save 
model = d21.LeNet(1r=learning_rate, num_classes=10) 
trainer = d21.HPOTrainer(max_epochs=max_epochs, num_gpus=1) 
data = d21.FashionMNIST(batch_size=batch_size) 
model. apply_init([next(iter(data. get_dataloader(True)))[0]], d21.init_cnn) 
trainer.fit(model=model, data=data) 
validation_error = trainer.validation_error() 
return validation_error 


We also need to define the configuration space. Moreover, the first configuration to be 
evaluated is the default setting used in Section 7.6. 


config_space = { 
"learning_rate”: stats. loguniform(1le-2, 1), 
"batch_size”: stats.randint(32, 256), 
3 
initial_config = { 
"learning_rate”: 0.1, 
"batch_size”: 128, 


Now we can start our random search: 


searcher = RandomSearcher(config_space, initial_config=initial_config) 
scheduler = BasicScheduler (searcher=searcher) 

tuner = HPOTuner(scheduler=scheduler, objective=hpo_objective_lenet) 
tuner. run(number_of_trials=5) 


error = 0.9000097513198853, runtime = 62.85189199447632 
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Below we plot the optimization trajectory of the incumbent to get the any-time performance 
of random search: 
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board = d21.ProgressBoard(xlabel="time”, ylabel="error” 
for time_stamp, error in zip( 

tuner.cumulative_runtime, tuner.incumbent_trajectory 
): 


board.draw(time_stamp, error, "random search”, every_n=1) 


— random search 
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19.2.6 Comparing HPO Algorithms 


Just as with training algorithms or model architectures, it is important to understand how 
to best compare different HPO algorithms. Each HPO run depends on two major sources 
of randomness: the random effects of the training process, such as random weight initial- 
ization or mini-batch ordering, and the intrinsic randomness of the HPO algorithm itself, 
such as the random sampling of random search. Hence, when comparing different algo- 
rithms, it is crucial to run each experiment several times and report statistics, such as mean 
or median, across a population of multiple repetitions of an algorithm based on different 
seeds of the random number generator. 


To illustrate this, we compare random search (see Section 19.1.2) and Bayesian optimiza- 
tion (Snoek et al., 2012) on tuning the hyperparameters of a feed-forward neural network. 
Each algorithm was evaluated 50 times with a different random seed. The solid line indi- 
cates the average performance of the incumbent across these 50 repetitions and the dashed 
line the standard deviation. We can see that random search and Bayesian optimization per- 
form roughly the same up to ~1000 seconds, but Bayesian optimization can make use of 
the past observation to identify better configurations and thus quickly outperforms random 
search afterwards. 
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| Example any-time performance plot to compare two algorithms A and B. 
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19.2.7 Summary 


This section laid out a simple, yet flexible interface to implement various HPO algorithms 
that we will look at in this chapter. Similar interfaces can be found in popular open-source 
HPO frameworks. We also looked at how we can compare HPO algorithms, and potential 
pitfall one needs to be aware. 


19.2.8 Exercises 


1. The goal of this exercise is to implement the objective function for a slightly more chal- 
lenging HPO problem, and to run more realistic experiments. We will use the two hidden 
layer MLP DropoutMLP implemented in Section 5.6. 


1. Code up the objective function, which should depend on all hyperparameters of the 
model and batch_size. Use max_epochs=50. GPUs do not help here, so num_gpus=9. 
Hint: Modify hpo_objective_lenet. 


2. Choose a sensible search space, where num_hiddens_1, num_hiddens_2 are integers 
in [8, 1024], and dropout values lie in [0, 0.95], while batch_size lies in [16, 384]. 
Provide code for config_space, using sensible distributions from scipy.stats. 


3. Run random search on this example with number_of_trials=20 and plot the re- 
sults. Make sure to first evaluate the default configuration of Section 5.6, which 
is initial_config = {'num_hiddens_1’: 256, 'num_hiddens_2’: 256, 
"dropout_1’: 0.5, 'dropout_2’: 0.5, ‘Ir’: 0.1, '‘batch_size': 256}. 


2. In this exercise, you will implement a new searcher (subclass of HPOSearcher) which 
makes decisions based on past data. It depends on parameters probab_local, num_init_random. 
Its sample_configuration method works as follows. For the first num_init_random 
calls, do the same as RandomSearcher.. sample_configuration. Otherwise, with prob- 
ability 1 - probab_local, do the same as RandomSearcher.sample_configuration. 
Otherwise, pick the configuration which attained the smallest validation error so far, 
select one of its hyperparameters at random, and sample its value randomly like in 
RandomSearcher.sample_configuration, but leave all other values the same. Re- 
turn this configuration, which is identical to the best configuration so far, except in this 
one hyperparameter. 


1. Code up this new LocalSearcher. Hint: Your searcher requires config_space as 
argument at construction. Feel free to use a member of type RandomSearcher. You 
will also have to implement the update method. 


2. Re-run the experiment from the previous exercise, but using your new searcher in- 
stead of RandomSearcher. Experiment with different values for probab_local, 
num_init_random. However, note that a proper comparison between different HPO 
methods requires repeating experiments several times, and ideally considering a 
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19.3 Asynchronous Random Search 


As we have seen in the previous Section 19.2, we might have to wait hours or even days be- 
fore random search returns a good hyperparameter configuration, because of the expensive 
evaluation of hyperparameter configurations. In practice, we have often access to a pool of 
resources such as multiple GPUs on the same machine or multiple machines with a single 
GPU. This begs the question: How do we efficiently distribute random search? 


In general, we distinguish between synchronous and asynchronous parallel hyperparameter 
optimization (see Fig. 19.3.1). In the synchronous setting, we wait for all concurrently 
running trials to finish, before we start the next batch. Consider configuration spaces that 
contain hyperparameters such as the number of filters or number of layers of a deep neural 
network. Hyperparameter configurations that contain a larger number of layers of filters 
will naturally take more time to finish, and all other trials in the same batch will have to wait 
at synchronisation points (grey area in Fig. 19.3.1) before we can continue the optimization 
process. 


In the asynchronous setting we immediately schedule a new trial as soon as resources be- 
come available. This will optimally exploit our resources, since we can avoid any synchro- 
nisation overhead. For random search, each new hyperparameter configuration is chosen 
independently of all others, and in particular without exploiting observations from any prior 
evaluation. This means we can trivially parallelize random search asynchronously. This is 
not straight-forward with more sophisticated methods that make decision based on previ- 
ous observations (see Section 19.5). While we need access to more resources than in the 
sequential setting, asynchronous random search exhibits a linear speed-up, in that a certain 
performance is reached K times faster if K trials can be run in parallel. 


Sequential Trial-0 | Trial-1 Trial-2 Trial-3 | Triai-4 | Triai-5 
Trial-0 Trial-2 Trial-4 
Synchronous 
Trial-1 Trial-3 Trial-5 
Trial-o | Trial-3 | Trial-4 
Asynchronous 
Trial-1 Trial-2 Trial-5 


Time 


Distributing the hyperparameter optimization process either synchronously or 
asynchronously. Compared to the sequential setting, we can reduce the overall wall-clock 
time while keep the total compute constant. Synchronous scheduling might lead to idling 
workers in the case of stragglers. 


In this notebook, we will look at asynchronous random search that, where trials are exe- 
cuted in multiple python processes on the same machine. Distributed job scheduling and 
execution is difficult to implement from scratch. We will use Syne Tune (Salinas et al., 
2022), which provides us with a simple interface for asynchronous HPO. Syne Tune is de- 
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signed to be run with different execution back-ends, and the interested reader is invited to 
study its simple APIs in order to learn more about distributed HPO. 


import logging 
from d21 import torch as d21 


logging. basicConfig(level=logging. INFO) 

from syne_tune import StoppingCriterion, Tuner 

from syne_tune.backend.python_backend import PythonBackend 
from syne_tune.config_space import loguniform, randint 
from syne_tune.experiments import load_experiment 

from syne_tune.optimizer.baselines import RandomSearch 


INFO: root: SageMakerBackend is not imported since dependencies are missing. You. 
«can install them with 
pip install 'syne-tune[extra]’ 
AWS dependencies are not imported since dependencies are missing. You can 
install them with 
pip install 'syne-tuneLaws]’ 
or (for everything) 
pip install 'syne-tuneLextra]’ 
AWS dependencies are not imported since dependencies are missing. You can 
«install them with 
pip install 'syne-tuneLaws]’ 
or (for everything) 
pip install 'syne-tuneL[extra]’ 
INFO:root:Ray Tune schedulers and searchers are not imported since. 
«dependencies are missing. You can install them with 
pip install 'syne-tune[raytune]’ 
or (for everything) 
pip install 'syne-tuneL[extra]’ 


19.3.1 Objective Function 


First, we have to define a new objective function such that it now returns the performance 
back to Syne Tune via the report callback. 


def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs): 
from syne_tune import Reporter 
from d21 import torch as d21 


model = d21.LeNet(1r=learning_rate, num_classes=10) 
trainer = d21.HPOTrainer(max_epochs=1, num_gpus=1) 
data = d21.FashionMNIST(batch_size=batch_size) 
model. apply_init([next(iter(data. get_dataloader(True)))[0]], d21.init_cnn) 
report = Reporter() 
for epoch in range(1, max_epochs + 1): 
if epoch == 1; 
# Initialize the state of Trainer 
trainer.fit(model=model, data=data) 
else: 
trainer. fit_epoch() 
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validation_error = trainer.validation_error().cpu().detach() .numpy() 
report(epoch=epoch, validation_error=float(validation_error)) 


Note that the PythonBackend of Syne Tune requires dependencies to be imported inside 
the function definition. 


19.3.2 Asynchronous Scheduler 


First, we define the number of workers that evaluate trials concurrently. We also need to 
specify how long we want to run random search, by defining an upper limit on the total 
wall-clock time. 


n_workers = 2 # Needs to be <= the number of available GPUs 


max_wallclock_time = 12 * 60 # 12 minutes 


Next, we state which metric we want to optimize and whether we want to minimize or 
maximize this metric. Namely, metric needs to correspond to the argument name passed 
to the report callback. 


mode = "min” 
metric = "validation_error” 


We use the configuration space from our previous example. In Syne Tune, this dictionary 
can also be used to pass constant attributes to the training script. We make use of this 
feature in order to pass max_epochs. Moreover, we specify the first configuration to be 
evaluated in initial_config. 


config_space = { 
"learning_rate”: loguniform(le-2, 1), 
"batch_size”: randint(32, 256), 
"max_epochs”: 10, 

} 

initial_config = { 
"learning_rate”: @.1, 
"batch_size”: 128, 


Next, we need to specify the back-end for job executions. Here we just consider the distri- 
bution on a local machine where parallel jobs are executed as sub-processes. However, for 
large scale HPO, we could run this also on a cluster or cloud environment, where each trial 
consumes a full instance. 


trial_backend = PythonBackend( 
tune_function=hpo_objective_lenet_synetune, 
config_space=config_space, 
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We can now create the scheduler for asynchronous random search, which is similar in be- 
haviour to our BasicScheduler from Section 19.2. 


scheduler = RandomSearch( 
config_space, 
metric=metric, 
mode=mode , 
points_to_evaluate=[initial_config], 


INFO: syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred. 
«from config_space 
INFO: syne_tune.optimizer.schedulers.fifo:Master random_seed = 2737092907 


Syne Tune also features a Tuner, where the main experiment loop and bookkeeping is 
centralized, and interactions between scheduler and back-end are mediated. 


stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time) 


tuner = Tuner( 
trial_backend=trial_backend, 
scheduler=scheduler, 
stop_criterion=stop_criterion, 
n_workers=n_workers, 
print_update_interval=int(max_wallclock_time * 0.6), 


Let us run our distributed HPO experiment. According to our stopping criterion, it will run 
for about 12 minutes. 


tuner. run() 


INFO:syne_tune.tuner:results of trials will be saved on /home/ci/syne-tune/ 
—python-entrypoint-2023-08-18-19-45-39-958 

INFO: root:Detected 4 GPUs 

INFO: root: running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.1 --batch_size 128 --max_epochs 10 --tune_function_root.. 
~/home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function -- 
<tune_function_hash 4d7d5b85e4537ad@c5d0a202623dcec5 --st_checkpoint_dir / 
~home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/@/checkpoints 
INFO: syne_tune. tuner: (trial 9) - scheduled config {’learning_rate’: 0.1, 

«+ 'batch_size': 128, 'max_epochs’: 10} 

INFO: root: running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 0.1702844732454753 --batch_size 114 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<1/checkpoints 

INFO: syne_tune. tuner: (trial 1) - scheduled config {’learning_rate': @. 


(continues on next page) 


847 


Asynchronous Random Search 


(continued from previous page) 


«1702844732454753, ‘batch_size’: 114, 'max_epochs': 10} 

INFO: syne_tune.tuner:Trial trial_id ð completed. 

INFO:syne_tune.tuner:Trial trial_id 1 completed. 

INFO: root: running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.34019846567238493 --batch_size 221 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<2/checkpoints 

INFO: syne_tune.tuner: (trial 2) - scheduled config {’learning_rate’': @. 
<~334019846567238493, 'batch_size’: 221, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.014628124155727769 --batch_size 88 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<3/checkpoints 

INFO: syne_tune. tuner: (trial 3) - scheduled config {’learning_rate': @. 
014628124155727769, ‘batch_size’: 88, 'max_epochs': 10} 

INFO: syne_tune.tuner:Trial trial_id 2 completed. 

INFO: root: running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.1114831485450576 --batch_size 142 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
—4/checkpoints 

INFO: syne_tune. tuner: (trial 4) - scheduled config {’learning_rate’': @. 
«1114831485450576, ‘'batch_size’: 142, 'max_epochs': 10} 
INFO:syne_tune.tuner:Trial trial_id 3 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.014076038679980779 --batch_size 223 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
—5/checkpoints 

INFO: syne_tune. tuner: (trial 5) - scheduled config {’learning_rate’': @. 
<3014076038679980779, ‘batch_size’: 223, 'max_epochs’: 10} 

INFO: syne_tune.tuner:Trial trial_id 4 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
spython3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.02558173674804846 --batch_size 62 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<6/checkpoints 

INFO: syne_tune.tuner: (trial 6) - scheduled config {’learning_rate’': @. 
-5@2558173674804846, 'batch_size’: 62, 'max_epochs': 10} 
INFO:syne_tune.tuner:Trial trial_id 5 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.026035979388614055 --batch_size 139 --max_epochs 10 -- 
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<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<7/checkpoints 

INFO: syne_tune. tuner: (trial 7) - scheduled config {’learning_rate': @. 
<3@26035979388614055, ‘batch_size’: 139, 'max_epochs’: 10} 
INFO:syne_tune.tuner:Trial trial_id 6 completed. 

INFO: root: running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate Q.24202494130424274 --batch_size 231 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
~checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
—8/checkpoints 

INFO: syne_tune. tuner: (trial 8) - scheduled config {’learning_rate': @. 
<24202494130424274, 'batch_size’: 231, 'max_epochs’: 10} 

INFO: syne_tune.tuner:Trial trial_id 7 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.10483132064775551 --batch_size 145 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<9/checkpoints 

INFO: syne_tune. tuner: (trial 9) - scheduled config {’learning_rate': @. 
<10483132064775551, ‘'batch_size’: 145, 'max_epochs’: 10} 
INFO:syne_tune.tuner:Trial trial_id 8 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.017898854850751864 --batch_size 51 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<10/checkpoints 

INFO: syne_tune. tuner: (trial 10) - scheduled config {'learning_rate’: ð. 
<30@17898854850751864, ‘batch_size’: 51, 'max_epochs’: 10} 

INFO: syne_tune.tuner:Trial trial_id 9 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.9645419978270817 --batch_size 200 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<11/checkpoints 

INFO: syne_tune. tuner: (trial 11) - scheduled config {'learning_rate’: ð. 
«9645419978270817, ‘'batch_size’: 200, 'max_epochs': 10} 

INFO: syne_tune.tuner:Trial trial_id 11 completed. 

INFO: root: running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
spython3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
py --learning_rate 2.10559888854748693 --batch_size 40 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
~checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<12/checkpoints 

INFO: syne_tune. tuner: (trial 12) - scheduled config {'’learning_rate’: Q. 
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+ 10559888854748693, 'batch_size’: 40, 'max_epochs': 10} 
INFO: syne_tune. tuner: tuning status (last metric is reported) 


trial_id status iter learning_rate batch_size max_epochs epoch Ț_ 
svalidation_error worker-time 

© Completed 10 0.100000 128 10 10.0 ‘es 
> 0.277195 64.928907 

1 Completed 10 0.170284 114 10 10.0 re 
> 0.286225 65.434195 

2 Completed 10 0.340198 221 10 10.0 es 
> 0.218990 59.729758 

3 Completed 10 0.014628 88 10 10.0 a 
> 0.899920 81.001636 

4 Completed 10 0.111483 142 10 10.0 asi 
= Q. 268684 64.427400 

5 Completed 10 0.014076 223 10 10.0 T 
o 0.899922 61.264475 

6 Completed 10 0.025582 62 10 10.0 ‘3 
o @. 399520 75.966186 

7 Completed 10 0.026036 139 10 10.0 Fe 
> 0.899988 62.261541 

8 Completed 10 Q@. 242025 231 10 10.0 = 
= @. 257636 58.186485 

9 Completed 10 0.104831 145 10 10.0 i 
o 0.273898 59.771699 

19 InProgress 8 0.017899 51 10 8.0 sal 
> 0.496118 66.999746 

11 Completed 10 0.964542 200 10 10.0 e 
o 0.181600 59.159662 

12 InProgress Q 0.105599 40 10 = s 


2 trials running, 11 finished (11 until the end), 436.6@s wallclock-time 


INFO: syne_tune.tuner:Trial trial_id 10 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
py --learning_rate 2.5846051207380589 --batch_size 35 --max_epochs 10 --tune_ 
—function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
—tune_function --tune_function_hash 4d7d5b85e4537ad@c5d@a202623dcec5 --st_ 
~checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<513/checkpoints 

INFO: syne_tune. tuner: (trial 13) - scheduled config {'learning_rate’: ð. 
5846051207380589, ‘'batch_size’: 35, 'max_epochs’: 10} 

INFO: syne_tune.tuner:Trial trial_id 12 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.2468891379769198 --batch_size 146 --max_epochs 10 -- 
<—tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<14/checkpoints 

INFO: syne_tune. tuner: (trial 14) - scheduled config {'’learning_rate’: Q. 
«2468891379769198, 'batch_size’: 146, 'max_epochs': 10} 

INFO: syne_tune.tuner:Trial trial_id 13 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
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py --learning_rate 2.12956867470224812 --batch_size 218 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
~checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<15/checkpoints 

INFO: syne_tune. tuner: (trial 15) - scheduled config {'learning_rate’: Q. 
<12956867470224812, 'batch_size’: 218, 'max_epochs’: 10} 
INFO:syne_tune.tuner:Trial trial_id 14 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.24900745354561854 --batch_size 103 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<16/checkpoints 

INFO: syne_tune. tuner: (trial 16) - scheduled config {'learning_rate’: ð. 
—24900745354561854, 'batch_size'’: 103, 'max_epochs': 10} 

INFO: syne_tune.tuner:Trial trial_id 15 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.03903577426988046 --batch_size 80 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39- 
<= 958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<17/checkpoints 

INFO: syne_tune. tuner: (trial 17) - scheduled config {'’learning_rate’: Q. 
-303903577426988046, 'batch_size’: 80, 'max_epochs': 10} 
INFO:syne_tune.tuner:Trial trial_id 16 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.01846559300690354 --batch_size 183 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-19-45-39- 
—958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d@a202623dcec5 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/ 
<18/checkpoints 

INFO: syne_tune. tuner: (trial 18) - scheduled config {'learning_rate’: ð. 
01846559300690354, 'batch_size': 183, 'max_epochs': 10} 

INFO: syne_tune.stopping_criterion: reaching max wallclock time (720), stopping. 
there. 

INFO: syne_tune.tuner:Stopping trials that may still be running. 

INFO: syne_tune.tuner:Tuning finished, results of trials can be found on /home/ 
—ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958 


Resource summary (last result is reported): 


trial_id status iter learning_rate batch_size max_epochs epoch Ț_ 
svalidation_error worker-time 

© Completed 10 0.100000 128 10 10 sis 
> Q@.277195 64.928907 

1 Completed 10 0.170284 114 10 10 T 
es 0.286225 65.434195 

2 Completed 10 0.340198 221 10 10 3 
> 0.218990 59.729758 

3 Completed 10 0.014628 88 10 10 A 
aa 0.899920 81.001636 

4 Completed 10 0.111483 142 10 10 i 


(continues on next page) 
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(continued from previous page) 


= 0.268684 64. 427400 


5 Completed 10 @.014076 223 10 10 i 
= 0.899922 61.264475 

6 Completed 10 0.025582 62 10 10 g 
=œ 0.399520 75.966186 

7 Completed 10 0.026036 139 10 10 gi 
= 0.899988 62.261541 

8 Completed 10 0.242025 231 10 10 a 
s 0.257636 58.186485 

9 Completed 10 0.104831 145 10 10 Š 
= 0.273898 59.771699 

10 Completed 10 0.017899 51 10 10 2 
=œ .405545 83.778503 

11 Completed 10 0.964542 200 10 10 A 
= 0.181600 59.159662 

12 Completed 10 0.105599 40 10 10 A 
= 0.182500 94.734384 

13 Completed 10 0.584605 35 10 10 T 
= 0.153846 110.965637 

14 Completed 10 0.246889 146 10 10 " 
= 0.215050 65.142847 

15 Completed 10 @.129569 218 10 10 2i 
=- 0.313873 61.310455 

16 Completed 10 0.249007 103 10 10 g 
= 0.196101 72, 0199127 

17 InProgress 9 @. 039036 80 10 9 a 
= 0.369000 73 . 403000 

18 InProgress 5 0.018466 183 10 5 


o @.900263 34.714568 
2 trials running, 17 finished (17 until the end), 722.84s wallclock-time 


validation_error: best @.14451533555984497 for trial-id 13 


The logs of all evaluated hyperparameter configurations are stored for further analysis. At 
any time during the tuning job, we can easily get the results obtained so far and plot the 
incumbent trajectory. 


d21.set_figsize() 
tuning_experiment = load_experiment (tuner .name) 
tuning_experiment.plot() 


WARNING: matplotlib.legend:No artists with labels found to put in legend. Note. 
«that artists whose label start with an underscore are ignored when legend(). 
sis called with no argument. 


19.3.3 Visualize the Asynchronous Optimization Process 


Below we visualize how the learning curves of every trial (each color in the plot represents 
a trial) evolve during the asynchronous optimization process. At any point in time, there 
are as many trials running concurrently as we have workers. Once a trial finishes, we 


852 


Hyperparameter Optimization 


Best result over time python-entrypoint-2023-08-18-19-45-39-958 
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immediately start the next trial, without waiting for the other trials to finish. Idle time 
of workers is reduced to a minimum with asynchronous scheduling. 


d21.set_figsize(L6, 2.5]) 
results = tuning_experiment.results 


for trial_id in results. trial_id.unique(): 
df = results[results[”"trial_id”] == trial_id] 
d21.plt.plot( 
df["st_tuner_time”], 
df["validation_error”], 
marker="0”" 


) 


d21.plt.xlabel("wall-clock time”) 
d21.plt.ylabel("objective function”) 


Text(@, @.5, ‘objective function’) 
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19.3.4 Summary 


We can reduce the waiting time for random search substantially by distribution trials across 
parallel resources. In general, we distinguish between synchronous scheduling and asyn- 
chronous scheduling. Synchronous scheduling means that we sample a new batch of hy- 
perparameter configurations once the previous batch finished. If we have a stragglers - 
trials that takes more time to finish than other trials - our workers need to wait at synchro- 
nization points. Asynchronous scheduling evaluates a new hyperparameter configurations 
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as soon as resources become available, and, hence, ensures that all workers are busy at 
any point in time. While random search is easy to distribute asynchronously and does not 
require any change of the actual algorithm, other methods require some additional modifi- 
cations. 


19.3.5 Exercises 


1. Consider the DropoutMLP model implemented in Section 5.6, and used in Exercise 1 of 
Section 19.2. 


1. Implement an objective function hpo_objective_dropoutmlp_synetune to be used 
with Syne Tune. Make sure that your function reports the validation error after every 
epoch. 


2. Using the setup of Exercise 1 in Section 19.2, compare random search to Bayesian 
optimization. If you use SageMaker, feel free to use Syne Tune’s benchmarking 
facilities in order to run experiments in parallel. Hint: Bayesian optimization is 
provided as syne_tune.optimizer.baselines.BayesianOptimization. 


3. For this exercise, you need to run on an instance with at least 4 CPU cores. For one 
of the methods used above (random search, Bayesian optimization), run experiments 
with n_workers=1, n_workers=2, n_workers=4, and compare results (incumbent 
trajectories). At least for random search, you should observe linear scaling with 
respect to the number of workers. Hint: For robust results, you may have to average 
over several repetitions each. 


2. Advanced. The goal of this exercise is to implement a new scheduler in Syne Tune. 


1. Create a virtual environment containing both the d2Ibook 7°’ and syne-tune 7°° 


sources. 


2. Implement the LocalSearcher from Exercise 2 in Section 19.2 as a new searcher in 


Syne Tune. Hint: Read this tutorial?°9. Alternatively, you may follow this example 
270 


3. Compare your new LocalSearcher with RandomSearch on the DropoutMLP bench- 
mark. 


Discussions27!. 


19.4 Multi-Fidelity Hyperparameter Optimization 
——————————————————————— 


Training neural networks can be expensive even on moderate size datasets. Depending 
on the configuration space (Section 19.1.1), hyperparameter optimization requires tens to 
hundreds of function evaluations to find a well-performing hyperparameter configuration. 
As we have seen in Section 19.3, we can significantly speed up the overall wall-clock time of 
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HPO by exploiting parallel resources, but this does not reduce the total amount of compute 
required. 


In this section, we will show how the evaluation of hyperparameter configurations can be 
sped up. Methods such as random search allocate the same amount of resources (e.g., 
number of epochs, training data points) to each hyperparameter evaluation. Fig. 19.4.1 
depicts learning curves of a set of neural networks trained with different hyperparameter 
configurations. After a few epochs we are already able to visually distinguish between well- 
performing and suboptimal configurations. However, the learning curves are noisy, and we 
might still require the full amount of 100 epochs to identify the best performing one. 
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_ Learning curves of random hyperparameter configurations 


Multi-fidelity hyperparameter optimization allocates more resources to promising configu- 
rations and stop evaluations of poorly performing ones early. This speeds up the optimiza- 
tion process, since we can try a larger number of configurations for the same total amount 
of resources. 


More formally, we expand our definition in Section 19.1.1, such that our objective function 
f(x,r) gets an additional input r € [Tins max], specifying the amount of resources that 
we are willing to spend for the evaluation of configuration x. We assume that the error 
f(x, r) decreases with r, whereas the computational cost c(x,r) increases. Typically, r 
represents the number of epochs for training the neural network, but it could also be the 
training subset size or the number of cross-validation folds. 


from collections import defaultdict 
import numpy as np 

from scipy import stats 

from d21 import torch as d21 


(continues on next page) 
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(continued from previous page) 


d21.set_figsize() 


19.4.1 Successive Halving 


One of the simplest ways to adapt random search to the multi-fidelity setting is successive 
halving (Jamieson and Talwalkar, 2016, Karnin et al., 2013). The basic idea is to start with 
N configurations, for example randomly sampled from the configuration space, and to train 
each of them for rmin epochs only. We then discard a fraction of the worst performing trials 
and train the remaining ones for longer. Iterating this process, fewer trials run for longer, 
until at least one trial reaches Fmax epochs. 


More formally, consider a minimum budget 7min (for example 1 epoch), a maximum bud- 
get Fmax, for example max_epochs in our previous example, and a halving constant 7 € 
{2,3,...}. For simplicity, assume that Fmax = Pea , with K € I. The number of initial 
configurations is then N = n5. Let us define the set of rungs R = {rmin, mint, Fiat’, Heed T max hs 


One round of successive halving proceeds as follows. We start with running N trials un- 
til the first rung rmin. Sorting the validation errors, we keep the top 1/7 fraction (which 
amounts to 7*~! configurations) and discard all the rest. The surviving trials are trained 
for the next rung (7min7) epochs), and the process is repeated. At each rung, a | /7 fraction of 
trials survives and their training continues with a 7 times larger budget. With this particular 
choice of N, only a single trial will be trained to the full budget Fmax. Once such a round 
of successive halving is done, we start the next one with a new set of initial configurations, 
iterating until the total budget is spent. 


rung levels 
To r rR r3 
0.504 i 
0.454 
w 
v 
2 
c 0.404 
2 
5 
oO 
2 
T 
> 0.354 
0.30 4 
0.25 T 2 T T i T T } 
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Learning curves of random hyperparameter configurations. 
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We subclass the HPOScheduler base class from Section 19.2 in order to implement succes- 
sive halving, allowing for a generic HPOSearcher object to sample configurations (which, 
in our example below, will be a RandomSearcher). Additionally, the user has to pass the 
minimum resource Fmin, the maximum resource Fmax and 7 as input. Inside our scheduler, 
we maintain a queue of configurations that still need to be evaluated for the current rung ri. 
We update the queue every time we jump to the next rung. 


class SuccessiveHalvingScheduler(d21.HPOScheduler): #@save 
def __init__(self, searcher, eta, r_min, r_max, prefact=1): 
self.save_hyperparameters() 
# Compute K, which is later used to determine the number of_ 
«configurations 
self.K = int(np.log(r_max / r_min) / np.log(eta)) 
# Define the rungs 
self.rung_levels = [r_min x eta xx k for k in range(self.K + 1)] 
if r_max not in self.rung_levels: 
# The final rung should be r_max 
self .rung_levels.append(r_max) 
self.K += 1 
# Bookkeeping 
self .observed_error_at_rungs = defaultdict(list) 
self.all_observed_error_at_rungs = defaultdict(list) 
# Our processing queue 
self.queue = [] 


In the beginning our queue is empty, and we fill it with n = prefact - 7* configurations, 
which are first evaluated on the smallest rung rmin. Here, prefact allows us to reuse our 
code in a different context. For the purpose of this section, we fix prefact = 1. Every time 
resources become available and the HPOTuner object queries the suggest function, we 
return an element from the queue. Once we finish one round of successive halving, which 
means that we evaluated all surviving configurations on the highest resource level rx and 
our queue is empty, we start the entire process again with a new, randomly sampled set of 
configurations. 


@d21.add_to_class(SuccessiveHalvingScheduler) #@save 
def suggest(self): 
if len(self.queue) == 0: 
# Start a new round of successive halving 
# Number of configurations for the first rung: 
nð = int(self.prefact x self.eta xx self.K) 
for _ in range(n@): 
config = self.searcher.sample_configuration() 
config[”max_epochs”] = self.r_min # Set r = r_min 
self .queue. append(config) 
# Return an element from the queue 
return self.queue.pop() 


When we collected a new data point, we first update the searcher module. Afterwards we 
check if we already collect all data points on the current rung. If so, we sort all configura- 
tions and push the top i configurations into the queue. 
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@d21.add_to_class(SuccessiveHalvingScheduler) #@save 
def update(self, config: dict, error: float, info=None): 
ri = int(configL”"max_epochs"]) # Rung r_i 
# Update our searcher, e.g if we use Bayesian optimization later 
self.searcher.update(config, error, additional_info=info) 
self.all_observed_error_at_rungs[ri].append((config, error)) 
if ri < self.r_max: 
# Bookkeeping 
self.observed_error_at_rungs[ri].append((config, error)) 
# Determine how many configurations should be evaluated on this rung 
ki = self.K - self.rung_levels.index(ri) 
ni = int(self.prefact x self.eta xx ki) 
# If we observed all configuration on this rung r_i, we estimate the 
# top 1 / eta configuration, add them to queue and promote them for 
# the next rung r_{i+1} 
if len(self.observed_error_at_rungs[ri]) >= ni: 
kiplus1 = ki - 1 
niplus1 = int(self.prefact * self.eta ** kiplus1) 
best_performing_configurations = self.get_top_n_configurations( 
rung_level=ri, n=niplus1 
) 
riplusl = self.rung_levels[self.K - kiplus1] # r_{it+1} 
# Queue may not be empty: insert new entries at the beginning 
self.queue = [ 
dict(config, max_epochs=riplus1) 
for config in best_performing_configurations 
] + self.queue 
self.observed_error_at_rungs[ri] = [] # Reset 


Configurations are sorted based on their observed performance on the current rung. 


@d21.add_to_class(SuccessiveHalvingScheduler) #@save 
def get_top_n_configurations(self, rung_level, n): 
rung = self.observed_error_at_rungs[rung_level] 
if not rung: 
return [] 
sorted_rung = sorted(rung, key=lambda x: x[1]) 
return [x[@] for x in sorted_rung[:n]] 


Let us see how successive halving is doing on our neural network example. We will use 
‘min = 2,7 = 2, Fmax = 10, so that rung levels are 2, 4, 8, 10. 


min_number_of_epochs = 
max_number_of_epochs 
eta = 2 

num_gpus=1 


Il 
=e N 
© 


config_space = { 
"learning_rate”: stats.loguniform(le-2, 1), 
"batch_size”: stats.randint(32, 256), 

3 

initial_config = { 
"learning_rate”: @.1, 


(continues on next page) 
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"batch_size”: 128, 


(continued from previous page) 


We just replace the scheduler with our new SuccessiveHalvingScheduler. 


searcher = d21.RandomSearcher(config_space, initial_config=initial_config) 
scheduler = SuccessiveHalvingScheduler ( 


searcher=searcher, 
eta=eta, 
r_min=min_number_of_epoch 
r_max=max_number_of_epoch 

) 

tuner = d21.HPOTuner( 
scheduler=scheduler, 
objective=d21.hpo_objecti 

) 


S, 
S, 


ve_lenet, 


tuner .run(number_of_trials=30) 


error = @.17762434482574463, runtime = 


53.576584339141846 


———————— r M M 
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We can visualize the learning curves of all configurations that we evaluated. Most of the 


configurations are stopped early and only the better performing configurations survive until 
Tmax. Compare this to vanilla random search, which would allocate Fmax to every config- 


uration. 


859 19.4 Multi-Fidelity Hyperparameter Optimization 


—— eee 
2.04 
1.54 
1.0 4 
— train_loss 
0.54 --- val loss 
=- val acc 
0.0 ; z a seed 
0.0 0.5 1.0 15 20 
epoch 
== 
2.04 
1.54 
1.04 
— train_loss 
0.54 --- val_loss 
—-- val acc 
0.0 ; sealer earl 
0.0 0.5 1.0 15 20 
epoch 
— train_loss 
2.07 ==- val_loss 
val_acc 
1.54 
To “= —\SStSsS=sey 
0.5 4 f aS 
0.0 0:5 1.0 15 20 
epoch 
EE] 
2.04 
1.54 
1.0 4 
— train_loss 
0.54 --- val_loss 
=i val_acc 
0.0 s =e 
0.0 0.5 1.0 1.5 20 


Hyperparameter Optimization 


— train_loss 
2.0 5 ==- val_loss 
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— train_loss 
2.04 k === val_loss 
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— train_loss 
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for rung_index, rung in scheduler.all_observed_error_at_rungs.items(): 
errors = [xi[1] for xi in rung] 
d21.plt.scatter(Lrung_index] * len(errors), errors) 
d21.plt.xlim(min_number_of_epochs - 2.5, max_number_of_epochs + Q.5) 
d21.plt.xticks( 
np.arange(min_number_of_epochs, max_number_of_epochs + 1), 
np.arange(min_number_of_epochs, max_number_of_epochs + 1) 


d21.plt.ylabel("validation error”) 
d21.plt.xlabel ("epochs") 


Text(@.5, ð, ‘epochs’) 
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Finally, note some slight complexity in our implementation of SuccessiveHalvingSched- 
uler. Say that a worker is free to run a job, and suggest is called when the current rung 
has almost been completely filled, but another worker is still busy with an evaluation. Since 
we lack the metric value from this worker, we cannot determine the top 1/7 fraction to open 
up the next rung. On the other hand, we want to assign a job to our free worker, so it does 
not remain idle. Our solution is to start a new round of successive halving and assign our 
worker to the first trial there. However, once a rung is completed in update, we make sure 
to insert new configurations at the beginning of the queue, so they take precedence over 
configurations from the next round. 


19.4.2 Summary 


In this section, we introduced the concept of multi-fidelity hyperparameter optimization, 
where we assume to have access to cheap-to-evaluate approximations of the objective func- 
tion, such as validation error after a certain number of epochs of training as proxy to val- 
idation error after the full number of epochs. Multi-fidelity hyperparameter optimization 
allows to reduce the overall computation of the HPO instead of just reducing the wall-clock 
time. 


We implemented and evaluated successive halving, a simple yet efficient multi-fidelity HPO 
algorithm. 


Discussions?”2. 
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19.5 Asynchronous Successive Halving 
E) 


As we have seen in Section 19.3, we can accelerate HPO by distributing the evaluation of 
hyperparameter configurations across either multiple instances or multiples CPUs / GPUs 
on a single instance. However, compared to random search, it is not straightforward to 
run successive halving (SH) asynchronously in a distributed setting. Before we can decide 
which configuration to run next, we first have to collect all observations at the current rung 
level. This requires to synchronize workers at each rung level. For example, for the lowest 
rung level rmin, we first have to evaluate all N = n£ configurations, before we can promote 
the i of them to the next rung level. 


In any distributed system, synchronization typically implies idle time for workers. First, 
we often observe high variations in training time across hyperparameter configurations. 
For example, assuming the number of filters per layer is a hyperparameter, then networks 
with less filters finish training faster than networks with more filters, which implies idle 
worker time due to stragglers. Moreover, the number of slots in a rung level is not always a 
multiple of the number of workers, in which case some workers may even sit idle for a full 
batch. 


Figure Fig. 19.5.1 shows the scheduling of synchronous SH with 7 = 2 for four different 
trials with two workers. We start with evaluating Trial-0 and Trial-1 for one epoch and 
immediately continue with the next two trials once they are finished. We first have to wait 
until Trial-2 finishes, which takes substantially more time than the other trials, before we 
can promote the best two trials, i.e., Trial-O and Trial-3 to the next rung level. This causes 
idle time for Worker-1. Then, we continue with Rung 1. Also, here Trial-3 takes longer 
than Trial-0, which leads to an additional ideling time of Worker-0. Once, we reach Rung-?2, 
only the best trial, Trial-0, remains which occupies only one worker. To avoid that Worker- | 
idles during that time, most implementaitons of SH continue already with the next round, 
and start evaluating new trials (e.g Trial-4) on the first rung. 


Synchronous Successive Halving 


eer [= ee [=] | m ] 
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Rung 0 ! Rungi } Rung 2 


Synchronous successive halving with two workers. 


Asynchronous successive halving (ASHA) (Li et al., 2018) adapts SH to the asynchronous 
parallel scenario. The main idea of ASHA is to promote configurations to the next rung 
level as soon as we collected at least 7 observations on the current rung level. This decision 
rule may lead to suboptimal promotions: configurations can be promoted to the next rung 
level, which in hindsight do not compare favourably against most others at the same rung 


868 


Hyperparameter Optimization 


level. On the other hand, we get rid of all synchronization points this way. In practice, such 
suboptimal initial promotions have only a modest impact on performance, not only because 
the ranking of hyperparameter configurations is often fairly consistent across rung levels, 
but also because rungs grow over time and reflect the distribution of metric values at this 
level better and better. If a worker is free, but no configuration can be promoted, we start a 
new configuration with r = rmin, i.e the first rung level. 


Fig. 19.5.2 shows the scheduling of the same configurations for ASHA. Once Trial-1 fin- 
ishes, we collect the results of two trials (i.e Trial-O and Trial-1) and immediately promote 
the better of them (Trial-0) to the next rung level. After Trial-0 finishes on rung 1, there 
are too few trials there in order to support a further promotion. Hence, we continue with 
rung 0 and evaluate Trial-3. Once Trial-3 finishes, Trial-2 is still pending. At this point we 
have 3 trials evaluated on rung 0 and one trial evaluated already on rung 1. Since Trial-3 
performs worse than Trial-0 at rung 0, and 7 = 2, we cannot promote any new trial yet, and 
Worker-1 starts Trial-4 from scratch instead. However, once Trial-2 finishes and scores 
worse than Trial-3, the latter is promoted towards rung 1. Afterwards, we collected 2 eval- 
uations on rung 1, which means we can now promote Trial-0 towards rung 2. At the same 
time, Worker-1 continues with evaluating new trials (i.e., Trial-5) on rung 0. 


Promotion to Rung 2 


Promotion to Rung 1 
\ 
= 


Worker-1 Trial-1 Trial-4 Trial-5 


Start new trial on Rung 0 


Promotion to Rung 1 Start new trial on Rung 0 


Asynchronous successive halving (ASHA) with two workers. 


import logging 
from d21 import torch as d21 


logging. basicConfig(level=logging. INFO) 

import matplotlib.pyplot as plt 

from syne_tune import StoppingCriterion, Tuner 

from syne_tune.backend.python_backend import PythonBackend 
from syne_tune.config_space import loguniform, randint 
from syne_tune.experiments import load_experiment 

from syne_tune.optimizer.baselines import ASHA 


INFO: root: SageMakerBackend is not imported since dependencies are missing. You. 
can install them with 

pip install 'syne-tune[extra]’ 
AWS dependencies are not imported since dependencies are missing. You can. 
«install them with 

pip install 'syne-tune[aws]’ 
or (for everything) 

pip install 'syne-tune[extra]’ 


(continues on next page) 
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AWS dependencies are not imported since dependencies are missing. You can. 
install them with 
pip install 'syne-tune[aws]' 
or (for everything) 
pip install '‘syne-tuneL[extra]’ 
INFO:root:Ray Tune schedulers and searchers are not imported since. 
«dependencies are missing. You can install them with 
pip install 'syne-tune[raytune] ' 
or (for everything) 
pip install 'syne-tune[extra]’ 


19.5.1 Objective Function 


We will use Syne Tune with the same objective function as in Section 19.3. 


def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs): 
from syne_tune import Reporter 
from d21 import torch as d21 


model = d21.LeNet(1r=learning_rate, num_classes=10) 
trainer = d21.HPOTrainer(max_epochs=1, num_gpus=1) 
data = d21.FashionMNIST(batch_size=batch_size) 
model. apply_init([next(iter(data.get_dataloader(True)))[0]], d21.init_cnn) 
report = Reporter() 
for epoch in range(1, max_epochs + 1): 
if epoch == 1: 
# Initialize the state of Trainer 
trainer.fit(model=model, data=data) 
else: 
trainer. fit_epoch() 
validation_error = trainer.validation_error().cpu().detach() .numpy() 
report(epoch=epoch, validation_error=float(validation_error)) 


We will also use the same configuration space as before: 


min_number_of_epochs = 
max_number_of_epochs = 10 
eta = 2 


config_space = { 
"learning_rate”: loguniform(le-2, 1), 
"batch_size”: randint(32, 256), 
"max_epochs”: max_number_of_epochs, 

} 

initial_config = { 
"learning_rate”: 0.1, 
"batch_size”: 128, 


19.5.2 Asynchronous Scheduler 
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First, we define the number of workers that evaluate trials concurrently. We also need to 
specify how long we want to run random search, by defining an upper limit on the total 
wall-clock time. 


n_workers = 2 # Needs to be <= the number of available GPUs 
max_wallclock_time = 12 * 60 # 12 minutes 


The code for running ASHA is a simple variation of what we did for asynchronous random 
search. 


mode = "min” 
metric = "validation_error” 
resource_attr = “epoch” 


scheduler = ASHA( 
config_space, 
metric=metric, 
mode=mode , 
points_to_evaluate=[initial_config], 
max_resource_attr="max_epochs”, 
resource_attr=resource_attr, 
grace_period=min_number_of_epochs, 
reduction_factor=eta, 


INFO: syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred. 
«from config_space 
INFO: syne_tune.optimizer.schedulers.fifo:Master random_seed = 3140976097 


Here, metric and resource_attr specify the key names used with the report callback, 
and max_resource_attr denotes which input to the objective function corresponds to rmax. 
Moreover, grace_period provides rmin, and reduction_factor is 7. We can run Syne 
Tune as before (this will take about 12 minutes): 


trial_backend = PythonBackend( 
tune_function=hpo_objective_lenet_synetune, 
config_space=config_space, 


) 


stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time) 
tuner = Tuner( 

trial_backend=trial_backend, 

scheduler=scheduler, 

stop_criterion=stop_criterion, 

n_workers=n_workers, 

print_update_interval=int(max_wallclock_time * 0.6), 
) 


tuner. run() 
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INFO:syne_tune.tuner:results of trials will be saved on /home/ci/syne-tune/ 
—python-entrypoint-2023-08-18-20-01-52-046 

INFO: root:Detected 4 GPUs 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.1 --batch_size 128 --max_epochs 10 --tune_function_root.. 
~/home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function -- 
«stune_function_hash e@3d187e043d2a17cae636d6af164015 --st_checkpoint_dir / 
<home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/0/checkpoints 
INFO: syne_tune. tuner: (trial 2) - scheduled config {’learning_rate’: @.1, 

«+ 'batch_size': 128, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.44639554136672527 --batch_size 196 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<1/checkpoints 

INFO: syne_tune. tuner: (trial 1) - scheduled config {’learning_rate’: @. 
~44639554136672527, 'batch_size’: 196, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.011548051321691994 --batch_size 254 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-20-01-52- 
—046/tune_function --tune_function_hash e0@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<2/checkpoints 

INFO: syne_tune. tuner: (trial 2) - scheduled config {’learning_rate’': @. 
<9@11548051321691994, ‘batch_size’: 254, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.14942487313193167 --batch_size 132 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e0@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<3/checkpoints 

INFO: syne_tune. tuner: (trial 3) - scheduled config {’learning_rate': @. 
<14942487313193167, ‘batch_size’: 132, 'max_epochs’: 10} 
INFO:syne_tune.tuner:Trial trial_id 1 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
py --learning_rate 2.06317157191455719 --batch_size 242 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
—4/checkpoints 

INFO: syne_tune. tuner: (trial 4) - scheduled config {’learning_rate’: @. 
«06317157191455719, 'batch_size’: 242, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.48801815412811467 --batch_size 41 --max_epochs 10 -- 
—tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
~5/checkpoints 

INFO: syne_tune. tuner: (trial 5) - scheduled config {’learning_rate’': @. 


(continues on next page) 


872 


Hyperparameter Optimization 


(continued from previous page) 


-548801815412811467, 'batch_size’: 41, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.5904067586747807 --batch_size 244 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<6/checkpoints 

INFO: syne_tune. tuner: (trial 6) - scheduled config {’learning_rate’': @. 
5904067586747807, ‘'batch_size’: 244, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
spython3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.08812857364095393 --batch_size 148 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<7/checkpoints 

INFO: syne_tune. tuner: (trial 7) - scheduled config {’learning_rate’: @. 
«08812857364095393, 'batch_size’: 148, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.012271314788363914 --batch_size 235 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
~8/checkpoints 

INFO: syne_tune. tuner: (trial 8) - scheduled config {’learning_rate’': @. 
<9@12271314788363914, ‘batch_size’: 235, 'max_epochs’: 10} 

INFO: syne_tune.tuner:Trial trial_id 5 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.08845692598296777 --batch_size 236 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<9/checkpoints 

INFO: syne_tune. tuner: (trial 9) - scheduled config {’learning_rate’': @. 
<08845692598296777, 'batch_size': 236, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.0825770880068151 --batch_size 75 --max_epochs 10 --tune_ 
—function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<tune_function --tune_function_hash e03d187e043d2al17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<10/checkpoints 

INFO: syne_tune. tuner: (trial 10) - scheduled config {'learning_rate’: Q. 
«0825770880068151, ‘'batch_size': 75, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate Q.20235201406823256 --batch_size 65 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<11/checkpoints 

INFO: syne_tune. tuner: (trial 11) - scheduled config {'’learning_rate’: Q. 
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<420235201406823256, 'batch_size’: 65, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.3359885631737537 --batch_size 58 --max_epochs 10 --tune_ 
—function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<tune_function --tune_function_hash eQ3d187e043d2al17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<12/checkpoints 

INFO: syne_tune. tuner: (trial 12) - scheduled config {'learning_rate’: Q. 
«3359885631737537, ‘'batch_size’: 58, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.7892434579795236 --batch_size 89 --max_epochs 10 --tune_ 
—function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<tune_function --tune_function_hash eQ3d187e043d2al17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<13/checkpoints 

INFO: syne_tune. tuner: (trial 13) - scheduled config {'learning_rate’: Q. 
«7892434579795236, ‘'batch_size’: 89, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
py --learning_rate Q.1233786579597858 --batch_size 176 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
~checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
~14/checkpoints 

INFO: syne_tune. tuner: (trial 14) - scheduled config {'learning_rate’: ð. 
«1233786579597858, 'batch_size’: 176, 'max_epochs': 10} 

INFO: syne_tune.tuner:Trial trial_id 13 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<spython3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.13707981127012328 --batch_size 141 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash eQ@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<15/checkpoints 

INFO: syne_tune. tuner: (trial 15) - scheduled config {'learning_rate’: Q. 
—13707981127012328, 'batch_size': 141, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.02913976299993913 --batch_size 116 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e0@3d187e043d2a17cae636d6af164015 --st_ 
scheckpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<16/checkpoints 

INFO: syne_tune. tuner: (trial 16) - scheduled config {'learning_rate’: Q. 
<3@2913976299993913, ‘batch_size’: 116, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
spython3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.033362897489792855 --batch_size 154 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
scheckpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<17/checkpoints 

INFO: syne_tune. tuner: (trial 17) - scheduled config {'learning_rate’: Q. 
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<033362897489792855, ‘'batch_size’: 154, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.29442952580755816 --batch_size 210 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash eQ@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<18/checkpoints 

INFO: syne_tune. tuner: (trial 18) - scheduled config {'learning_rate’: Q. 
<29442952580755816, ‘batch_size’: 210, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.10214259921521483 --batch_size 239 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
~19/checkpoints 

INFO: syne_tune. tuner: (trial 19) - scheduled config {'’learning_rate’: Q. 
<10214259921521483, 'batch_size’: 239, 'max_epochs': 10} 

INFO: syne_tune. tuner: tuning status (last metric is reported) 


trial_id status iter learning_rate batch_size max_epochs epoch _ 
«validation_error worker-time 
Q Stopped 4 0.100000 128 10 4.0 a 
> 0.430578 29.093798 
1 Completed 10 0.446396 196 10 10.0 sl 
o @. 205652 72.747496 
2 Stopped 2 0.011548 254 10 2.0 = 
> 0.900570 13729115 
3 Stopped 8 0.149425 L32 10 8.0 ssi 
> 0.259171 58.980305 
4 Stopped 4 0.063172 242 10 4.0 ‘a 
> 0.900579 27.773950 
5 Completed 10 0.488018 41 10 10.0 ‘a 
o 0.140488 113: 171314 
6 Stopped 10 0.590407 244 10 10.0 E 
o 0.193776 70.364757 
7 Stopped 2 0.088129 148 10 2.0 eat 
> 0.899955 14.169738 
8 Stopped 2 0.012271 235 10 2.0 a 
o Q.899840 13.434274 
9 Stopped 2 0.088457 236 10 2.0 ‘is 
= 0.899801 13.034437 
10 Stopped 4 0.082577 75 10 4.0 i 
o 0.385970 35.426524 
141 Stopped 4 0.202352 65 10 4.0 E 
o 0.543102 34.653495 
12 Stopped 10 0.335989 58 10 10.0 sas 
> 0.149558 90.924182 
13 Completed 10 0.789243 89 10 10.0 ay 
o 0.144887 77.365970 
14 Stopped 2 0.123379 176 10 2.0 ii 
o 0.899987 12.422906 
15 Stopped 2 0.137080 141 10 2.0 a 
= 0.899983 13.395153 
16 Stopped 4 0.029140 116 10 4.0 ‘ai 
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= 0.900532 27.834111 


17 Stopped 2 0.033363 154 10 2.0 ish 
= 0.899996 13.407285 

18 InProgress 1 0.294430 210 10 1.0 T 
= 0.899878 6.126259 

19 InProgress @ @.102143 239 10 = á 


=> 


2 trials running, 18 finished (3 until the end), 437.07s wallclock-time 


INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
=—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.02846298236356246 --batch_size 115 --max_epochs 10 -- 
«tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_ 
«checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
=—20/checkpoints 

INFO:syne_tune.tuner:(trial 20) - scheduled config {'learning_rate': Q. 
«02846298236356246, 'batch_size': 115, 'max_epochs': 10} 

INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
=—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.037703019195187606 --batch_size 91 --max_epochs 10 -- 
«tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
=—046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_ 
«checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
«21/checkpoints 

INFO: syne_tune. tuner: (trial 21) - scheduled config {'learning_rate': Q. 
<037703019195187606, ‘batch_size’: 91, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.0741039859356903 --batch_size 192 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<422/checkpoints 

INFO: syne_tune. tuner: (trial 22) - scheduled config {'learning_rate’: ð. 
«0741039859356903, ‘'batch_size’: 192, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate @.3032613031191755 --batch_size 252 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<+23/checkpoints 

INFO: syne_tune. tuner: (trial 23) - scheduled config {'learning_rate’: ð. 
«3032613031191755, ‘'batch_size’: 252, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
opython3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.019823425532533637 --batch_size 252 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-@8-18-20-01-52- 
—046/tune_function --tune_function_hash eQ@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<24/checkpoints 

INFO: syne_tune.tuner: (trial 24) - scheduled config {'learning_rate’: ð. 
<3@19823425532533637, ‘batch_size’: 252, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
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python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.8203370335228594 --batch_size 77 --max_epochs 10 --tune_ 
<function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<tune_function --tune_function_hash e03d187e043d2al17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<+25/checkpoints 

INFO: syne_tune. tuner: (trial 25) - scheduled config {'learning_rate’: ð. 
«8203370335228594, 'batch_size’: 77, 'max_epochs’: 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
opython3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate .2960420911378594 --batch_size 104 --max_epochs 10 -- 
stune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<26/checkpoints 

INFO: syne_tune.tuner: (trial 26) - scheduled config {'learning_rate’: ð. 
«2960420911378594, 'batch_size’: 104, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
<python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.2993874715754653 --batch_size 192 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<27/checkpoints 

INFO: syne_tune. tuner: (trial 27) - scheduled config {'learning_rate’: Q. 
«2993874715754653, ‘batch_size’: 192, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
spython3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
-py --learning_rate 2.08056711961080017 --batch_size 36 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<+28/checkpoints 

INFO: syne_tune. tuner: (trial 28) - scheduled config {'’learning_rate’: Q. 
-508056711961080017, 'batch_size’: 36, 'max_epochs': 10} 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
—python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
apy --learning_rate 2.26868380288030347 --batch_size 151 --max_epochs 10 -- 
<tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52- 
—046/tune_function --tune_function_hash e@3d187e043d2a17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<+29/checkpoints 

INFO: syne_tune. tuner: (trial 29) - scheduled config {'learning_rate’: Q. 
<+26868380288030347, ‘'batch_size’: 151, 'max_epochs’: 10} 
INFO:syne_tune.tuner:Trial trial_id 29 completed. 

INFO: root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/ 
python3. 8/site-packages/syne_tune/backend/python_backend/python_entrypoint. 
spy --learning_rate 2.9197404791177789 --batch_size 66 --max_epochs 10 --tune_ 
~function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<tune_function --tune_function_hash eQ3d187e043d2al17cae636d6af164015 --st_ 
—checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/ 
<30/checkpoints 

INFO: syne_tune.tuner: (trial 30) - scheduled config {'learning_rate’: ð. 
«9197404791177789, 'batch_size’: 66, 'max_epochs’: 10} 

INFO: syne_tune.stopping_criterion: reaching max wallclock time (720), stopping. 
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othere. 
INFO: syne_tune.tuner:Stopping trials that may still be running. 
INFO: syne_tune.tuner:Tuning finished, results of trials can be found on /home/ 


—ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046 


Resource summary (last result is reported): 
trial_id learning_rate batch_size max_epochs epoch _ 


«validation_error worker-time 


= @ 
a 0 
= 0 
= 0 
x 0 
a ő 
i 6 
> o 
ET 
TE 
= 0 
+ ® 
x 0 
a 0 
=% 0 
= 0 
a O 
a 0 
= ð 
a Ô 
a Ó 
a 6 
+ 6 


status iter 


i) Stopped 4 
. 430578 29.093798 
1 Completed 10 
205652 72.747496 
2 Stopped 2 
. 900570 13.729115 

3 Stopped 8 
259171 58. 980305 

4 Stopped 4 
. 900579 27.773950 

5 Completed 10 
. 140488 113.171314 

6 Stopped 10 
. 193776 70. 364757 

7 Stopped 2 
.899955 14.169738 

8 Stopped 2 
. 899840 13.434274 

9 Stopped 2 
.899801 13.034437 
10 Stopped 4 
. 385970 35.426524 
11 Stopped 4 
. 543102 34.653495 
12 Stopped 10 
. 149558 90.924182 
13 Completed 10 
. 144887 77.365970 
14 Stopped 2 
. 899987 12.422906 
15 Stopped 2 
.899983 13.395153 
16 Stopped 4 
. 900532 27.834111 
17 Stopped 2 
. 899996 13.407285 
18 Stopped 8 
.241193 52.089688 
19 Stopped 2 
. 900002 12.487762 
20 Stopped 2 
.899995 14.100359 
21 Stopped 2 
. 900026 14.664848 
22 Stopped 2 
. 901730 13.312770 
23 Stopped 2 
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= 0.900009 12.725821 


24 Stopped 2 0,019823 252 10 2 i 
« 0.899917 12.533380 

25 Stopped 10 0.820337 77 10 10 
= 0.196842 81.816103 

26 Stopped 10 0.296042 104 10 10 a 
= 0.198453 81.121330 

27 Stopped 4 0.299387 192 10 4 i 
«æ 0.336183 24.610689 

28 InProgress 9 @.080567 36 10 9 Si 
= 0.203052 104.303746 

29 Completed 10 0.268684 151 10 10 nn 
= 0.222814 68.217289 

30 InProgress 1 0.919740 66 10 1 Es 


o Q@.900037 10.070776 
2 trials running, 29 finished (4 until the end), 723.70s wallclock-time 


validation_error: best @.1404876708984375 for trial-id 5 


Note that we are running a variant of ASHA where underperforming trials are stopped 
early. This is different to our implementation in Section 19.4.1, where each training job is 
started with a fixed max_epochs. In the latter case, a well-performing trial which reaches 
the full 10 epochs, first needs to train 1, then 2, then 4, then 8 epochs, each time starting 
from scratch. This type of pause-and-resume scheduling can be implemented efficiently by 
checkpointing the training state after each epoch, but we avoid this extra complexity here. 
After the experiment has finished, we can retrieve and plot results. 


d21.set_figsize() 
e = load_experiment(tuner.name) 
e.plot() 


WARNING: matplotlib.legend:No artists with labels found to put in legend. Note. 
>that artists whose label start with an underscore are ignored when legend().. 
sis called with no argument. 


Best result over time python-entrypoint-2023-08-18-20-01-52-046 


o 
fo) 
f 


validation_error 


0 200 400 600 
wallclock time 


19.5.3 Visualize the Optimization Process 
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Once more, we visualize the learning curves of every trial (each color in the plot represents 
a trial). Compare this to asynchronous random search in Section 19.3. As we have seen 
for successive halving in Section 19.4, most of the trials are stopped at 1 or 2 epochs (‘min 
or 7) * 'min). However, trials do not stop at the same point, because they require different 
amount of time per epoch. If we ran standard successive halving instead of ASHA, we 
would need to synchronize our workers, before we can promote configurations to the next 
rung level. 


d21.set_figsize([6, 2.5]) 
results = e.results 
for trial_id in results. trial_id.unique(): 
df = results[results[”"trial_id”] == trial_id] 
d21.plt.plot( 
df["st_tuner_time”], 
dfL["validation_error”], 
marker="0" 


) 
d2l.plt.xlabel(”wall-clock time”) 
d21.plt.ylabel("objective function”) 


Text(@, 0.5, ‘objective function’) 
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19.5.4 Summary 


Compared to random search, successive halving is not quite as trivial to run in an asyn- 
chronous distributed setting. To avoid synchronisation points, we promote configurations 
as quickly as possible to the next rung level, even if this means promoting some wrong 
ones. In practice, this usually does not hurt much, and the gains of asynchronous versus 
synchronous scheduling are usually much higher than the loss of the suboptimal decision 
making. 


Discussions273 , 
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20.1 Generative Adversarial Networks 


Throughout most of this book, we have talked about how to make predictions. In some 
form or another, we used deep neural networks to learn mappings from data examples to 
labels. This kind of learning is called discriminative learning, as in, we’d like to be able 
to discriminate between photos of cats and photos of dogs. Classifiers and regressors are 
both examples of discriminative learning. And neural networks trained by backpropaga- 
tion have upended everything we thought we knew about discriminative learning on large 
complicated datasets. Classification accuracies on high-res images have gone from useless 
to human-level (with some caveats) in just 5-6 years. We will spare you another spiel about 
all the other discriminative tasks where deep neural networks do astoundingly well. 


But there is more to machine learning than just solving discriminative tasks. For example, 
given a large dataset, without any labels, we might want to learn a model that concisely 
captures the characteristics of this data. Given such a model, we could sample synthetic 
data examples that resemble the distribution of the training data. For example, given a large 
corpus of photographs of faces, we might want to be able to generate a new photorealistic 
image that looks like it might plausibly have come from the same dataset. This kind of 
learning is called generative modeling. 


Until recently, we had no method that could synthesize novel photorealistic images. But the 
success of deep neural networks for discriminative learning opened up new possibilities. 
One big trend over the last three years has been the application of discriminative deep 
nets to overcome challenges in problems that we do not generally think of as supervised 
learning problems. The recurrent neural network language models are one example of using 
a discriminative network (trained to predict the next character) that once trained can act as 
a generative model. 


In 2014, a breakthrough paper introduced Generative adversarial networks (GANs) (Good- 
fellow et al., 2014), a clever new way to leverage the power of discriminative models to get 
good generative models. At their heart, GANs rely on the idea that a data generator is good 
if we cannot tell fake data apart from real data. In statistics, this is called a two-sample test 
- a test to answer the question whether datasets X = {x1,.. . Xn} and X’ = {x',...,xj,} 
were drawn from the same distribution. The main difference between most statistics papers 


881 


Generative Adversarial Networks 


and GANs is that the latter use this idea in a constructive way. In other words, rather than 
just training a model to say “hey, these two datasets do not look like they came from the 
same distribution”, they use the two-sample test?” to provide training signals to a gener- 
ative model. This allows us to improve the data generator until it generates something that 
resembles the real data. At the very least, it needs to fool the classifier even if our classifier 
is a state of the art deep neural network. 


Is real or fake 


Discriminator 


| Fake G(z) | Real x 


Generator 


Generative Adversarial Networks 


The GAN architecture is illustrated in Fig. 20.1.1. As you can see, there are two pieces 
in GAN architecture - first off, we need a device (say, a deep network but it really could 
be anything, such as a game rendering engine) that might potentially be able to generate 
data that looks just like the real thing. If we are dealing with images, this needs to generate 
images. If we are dealing with speech, it needs to generate audio sequences, and so on. 
We call this the generator network. The second component is the discriminator network. It 
attempts to distinguish fake and real data from each other. Both networks are in competition 
with each other. The generator network attempts to fool the discriminator network. At that 
point, the discriminator network adapts to the new fake data. This information, in turn is 
used to improve the generator network, and so on. 


The discriminator is a binary classifier to distinguish if the input x is real (from real data) or 
fake (from the generator). Typically, the discriminator outputs a scalar prediction o € R for 
input x, such as using a fully connected layer with hidden size 1, and then applies sigmoid 
function to obtain the predicted probability D(x) = 1/(1 + e7°). Assume the label y 
for the true data is 1 and 0 for the fake data. We train the discriminator to minimize the 
cross-entropy loss, i.e., 


min{—y log D(x) — (1 - y) log(1 - D(x))}, (20.1.1) 


For the generator, it first draws some parameter z € R? from a source of randomness, e.g., 
anormal distribution z ~ N(0, 1). We often call z as the latent variable. It then applies 
a function to generate x’ = G(z). The goal of the generator is to fool the discriminator 
to classify x’ = G(z) as true data, i.e., we want D(G(z)) ~ 1. In other words, for a 
given discriminator D, we update the parameters of the generator G to maximize the cross- 
entropy loss when y = 0, i.e., 


max{—(1 — y) log(1 - D(G(z)))} = max{- log(1 - D(G(z)))}. (20.1.2) 


If the generator does a perfect job, then D(x’) ~ 1, so the above loss is near 0, which 
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results in the gradients that are too small to make good progress for the discriminator. So 
commonly, we minimize the following loss: 


min{—ylog(D(G(z)))} = min{—log(D(G(z)))}, (20.1.3) 


which is just feeding x’ = G(z) into the discriminator but giving label y = 1. 


To sum up, D and G are playing a “minimax” game with the comprehensive objective 
function: 


min max{—Ex~Data log D(x) — Ez-Noise log(1 — D(G(z)))}. (20.1.4) 


Many of the GANs applications are in the context of images. As a demonstration purpose, 
we are going to content ourselves with fitting a much simpler distribution first. We will 
illustrate what happens if we use GANs to build the world’s most inefficient estimator of 
parameters for a Gaussian. Let’s get started. 


%matplotlib inline 

import torch 

from torch import nn 

from d21 import torch as d21 


20.1.1 Generate Some “Real” Data 


Since this is going to be the world’s lamest example, we simply generate data drawn from 
a Gaussian. 


torch.normal(@.@, 1, (1000, 2)) 
torch.tensor([[1, 2], [-@.1, @.5]]) 
torch.tensor([1, 2]) 

ata = torch.matmul(X, A) + b 


X 
A 
b 
d 


Let’s see what we got. This should be a Gaussian shifted in some rather arbitrary way with 
mean b and covariance matrix AT A. 


d21.set_figsize() 
d21.plt.scatter(datal:100, (@)].detach().numpy(), data[:100, (1)].detach(). 


<numpy ()) ; 
print(f'The covariance matrix is\n{torch.matmul(A.T, A)}') 


The covariance matrix is 
tensor(L[[1.0100, 1.9500], 
[1.9500, 4.2500]]) 


batch_size = 8 
data_iter = d21.load_array((data,), batch_size) 
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20.1.2 Generator 


Our generator network will be the simplest network possible - a single layer linear model. 
This is since we will be driving that linear network with a Gaussian data generator. Hence, 
it literally only needs to learn the parameters to fake things perfectly. 


net_G = nn.Sequential(nn.Linear(2, 2)) 


20.1.3 Discriminator 


For the discriminator we will be a bit more discriminating: we will use an MLP with 3 
layers to make things a bit more interesting. 


net_D = nn.Sequential( 
nn.Linear(2, 5), nn.Tanh(), 
nn.Linear(5, 3), nn.Tanh(), 
nn.Linear(3, 1)) 


20.1.4 Training 


First we define a function to update the discriminator. 


#@save 
def update_D(X, Z, net_D, net_G, loss, trainer_D): 
"""Update discriminator.””” 
batch_size = X.shape[0] 
ones = torch.ones((batch_size,), device=X.device) 
zeros = torch.zeros((batch_size,), device=X.device) 
trainer_D.zero_grad() 
real_Y = net_D(X) 
fake_X = net_G(Z) 
# Do not need to compute gradient for ‘net_G*, detach it from 
# computing gradients. 
fake_Y = net_D(fake_X.detach()) 
loss_D = (loss(real_Y, ones.reshape(real_Y.shape)) + 
loss(fake_Y, zeros.reshape(fake_Y.shape))) / 2 
loss_D.backward() 
trainer_D.step() 
return loss_D 


iT] 
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The generator is updated similarly. Here we reuse the cross-entropy loss but change the 
label of the fake data from 0 to 1. 


#@save 

def update_G(Z, net_D, net_G, loss, trainer_G): 
"""Update generator.”"” 
batch_size = Z.shapeLQ] 
ones = torch.ones((batch_size,), device=Z.device) 
trainer_G.zero_grad() 
# We could reuse ‘fake_X* from ‘update_D* to save computation 
fake_X = net_G(Z) 
# Recomputing ‘fake_Y‘ is needed since ‘net_D‘ is changed 
fake_Y = net_D(fake_X) 
loss_G = loss(fake_Y, ones. reshape(fake_Y. shape) ) 
loss_G.backward() 
trainer_G.step() 
return loss_G 


Both the discriminator and the generator performs a binary logistic regression with the 
cross-entropy loss. We use Adam to smooth the training process. In each iteration, we first 
update the discriminator and then the generator. We visualize both losses and generated 
examples. 


def train(net_D, net_G, data_iter, num_epochs, lr_D, Ir_G, latent_dim, data): 
loss = nn.BCEWithLogitsLoss(reduction=' sum’) 
for w in net_D.parameters(): 
nn.init.normal_(w, 2, 0.02) 
for w in net_G.parameters(): 
nn.init.normal_(w, 2, 0.02) 
trainer_D = torch.optim.Adam(net_D.parameters(), 1lr=1r_D) 
trainer_G = torch.optim.Adam(net_G.parameters(), 1lr=lr_G) 
animator = d21.Animator(xlabel='epoch’, ylabel='loss’, 
xlim=[1, num_epochs], nrows=2, figsize=(5, 5), 
legend=['discriminator’, ‘generator’ ]) 
animator. fig.subplots_adjust (hspace=0. 3) 
for epoch in range(num_epochs) : 
# Train one epoch 
timer = d21.Timer() 
metric = d21.Accumulator(3) # loss_D, loss_G, num_examples 
for (X,) in data_iter: 
batch_size = X.shape[Q] 
Z = torch.normal(@, 1, size=(batch_size, latent_dim)) 
metric.add(update_D(X, Z, net_D, net_G, loss, trainer_D), 
update_G(Z, net_D, net_G, loss, trainer_G), 
batch_size) 
# Visualize generated examples 
Z = torch.normal(@, 1, size=(100, latent_dim)) 
fake_X = net_G(Z).detach() .numpy() 
animator .axes[1].cla() 
animator.axes[1].scatter(data[:, 2], datal:, 1]) 
animator.axes[1].scatter(fake_X[:, @], fake_X[:, 1]) 
animator.axes[1].legend(['’real', ‘generated’ ]) 
# Show the losses 
loss_D, loss_G = metricl0]/metricl2], metric[1]/metric[2] 


(continues on next page) 
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(continued from previous page) 


animator.add(epoch + 1, (loss_D, loss_G)) 
print(f’loss_D {loss_D:.3f}, loss_G {loss_G:.3f}, ' 
f'{metric[2] / timer.stop():.1f} examples/sec’) 


Now we specify the hyperparameters to fit the Gaussian distribution. 


Ir_D, Ir_G, latent_dim, num_epochs = 0.05, 0.005, 2, 20 
train(net_D, net_G, data_iter, num_epochs, I1r_D, I1r_G, 
latent_dim, data[l:100].detach() .numpy()) 


loss_D 0.693, loss_G 0.693, 1020.0 examples/sec 
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20.1.5 Summary 


Generative adversarial networks (GANs) composes of two deep networks, the generator 
and the discriminator. 


e The generator generates the image as much closer to the true image as possible to fool 
the discriminator, via maximizing the cross-entropy loss, i.e., max log( D(x’)). 


The discriminator tries to distinguish the generated images from the true images, via 
minimizing the cross-entropy loss, i.e., min —y log D(x) — (1 — y) log(1 — D(x)). 


20.1.6 Exercises 


Does an equilibrium exist where the generator wins, i.e. the discriminator ends up unable 
to distinguish the two distributions on finite samples? 


Discussions2”°. 
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20.2 Deep Convolutional Generative Adversarial 
Networks 


In Section 20.1, we introduced the basic ideas behind how GANs work. We showed that 
they can draw samples from some simple, easy-to-sample distribution, like a uniform or 
normal distribution, and transform them into samples that appear to match the distribution 
of some dataset. And while our example of matching a 2D Gaussian distribution got the 
point across, it is not especially exciting. 


In this section, we will demonstrate how you can use GANs to generate photorealistic im- 
ages. We will be basing our models on the deep convolutional GANs (DCGAN) introduced 
in Radford et al. (2015). We will borrow the convolutional architecture that have proven 
so successful for discriminative computer vision problems and show how via GANs, they 
can be leveraged to generate photorealistic images. 


import warnings 

import torch 

import torchvision 

from torch import nn 

from d21 import torch as d21 


20.2.1 The Pokemon Dataset 


The dataset we will use is a collection of Pokemon sprites obtained from pokemondb2”°. 
First download, extract and load this dataset. 


#@save 
d21.DATA_HUB[’pokemon’] = (d21.DATA_URL + 'pokemon.zip’, 
"c0@65c0e2593b8b161a2d7873e42418bf6a21106c’) 


data_dir = d21.download_extract(’ pokemon’ ) 
pokemon = torchvision.datasets.ImageFolder(data_dir) 


Downloading ../data/pokemon.zip from http://d21-data.s3-accelerate.amazonaws. 
<com/pokemon. zip... 


We resize each image into 64 x 64. The ToTensor transformation will project the pixel 
value into [0, 1], while our generator will use the tanh function to obtain outputs in [—1, 1]. 
Therefore we normalize the data with 0.5 mean and 0.5 standard deviation to match the 
value range. 


batch_size = 256 
transformer = torchvision. transforms. Compose([ 
torchvision. transforms.Resize((64, 64)), 


(continues on next page) 
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torchvision. transforms. ToTensor(), 
torchvision. transforms.Normalize(@.5, 0.5) 
7) 
pokemon. transform = transformer 
data_iter = torch.utils.data.DataLoader( 
pokemon, batch_size=batch_size, 
shuffle=True, num_workers=d21.get_dataloader_workers()) 


Let’s visualize the first 20 images. 


warnings. filterwarnings(’ ignore’) 
d21.set_figsize((4, 4)) 
for X, y in data_iter: 


imgs = X[:20,:,:,:].permute(Q, 2, 3, 1)/2+0.5 
d21.show_images(imgs, num_rows=4, num_cols=5) 
break 
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20.2.2 The Generator 


The generator needs to map the noise variable z € Rf, a length-d vector, to a RGB image 
with width and height to be 64x64 . In Section 14.11 we introduced the fully convolutional 
network that uses transposed convolution layer (refer to Section 14.10) to enlarge input size. 
The basic block of the generator contains a transposed convolution layer followed by the 
batch normalization and ReLU activation. 


class G_block(nn.Module): 
def __init__(self, out_channels, in_channels=3, kernel_size=4, strides=2, 
padding=1, **kwargs): 


(continues on next page) 
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super(G_block, self).__init__(**kwargs) 

self.conv2d_trans = nn.ConvTranspose2d(in_channels, out_channels, 
kernel_size, strides, padding, bias=False) 

nn.BatchNorm2d(out_channels) 

nn.ReLU() 


self .batch_norm 
self .activation 


def forward(self, X): 
return self.activation(self.batch_norm(self .conv2d_trans(X))) 


In default, the transposed convolution layer uses a kp = kw = 4 kernel, a sp = Sw = 2 
strides, and a pp, = pw = 1 padding. With a input shape of n, Xn, = 16x 16, the generator 
block will double input’s width and height. 


nX My = [(nnkn — (nn — 1)(kn — Sn) — 2pn] X [(nwkw — (nw — 1) (Rw — Sw) - 2pw] 
= [(kn + sn(na — 1) — 2pn] X [(kw + sw(nw ~= 1) - 2p] 
=[|(4+2x(16-1)-2x1]x[(4+2x(16-1)-2x1] 
=32x 32. 
(20.2.1) 


x = torch.zeros((2, 3, 16, 16)) 
g_blk = G_block(20) 
g_blk(x).shape 


torch.Size([2, 20, 32, 32]) 


If changing the transposed convolution layer to a 4x4 kernel, 1 x 1 strides and zero padding. 
With a input size of | x 1, the output will have its width and height increased by 3 respec- 
tively. 


x = torch.zeros((2, 3, 1, 1)) 
g_blk = G_block(20, strides=1, padding=0) 
g_blk(x).shape 


torch.Size([2, 20, 4, 4]) 


The generator consists of four basic blocks that increase input’s both width and height from 
1 to 32. At the same time, it first projects the latent variable into 64 x 8 channels, and then 
halve the channels each time. At last, a transposed convolution layer is used to generate 
the output. It further doubles the width and height to match the desired 64 x 64 shape, 
and reduces the channel size to 3. The tanh activation function is applied to project output 
values into the (—1, 1) range. 


n_G = 64 
net_G = nn.Sequential( 
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G_block(in_channels=100, out_channels=n_G*8, 

strides=1, padding=0) , # Output: (64 * 8, 4, 4) 
G_block(in_channels=n_G*8, out_channels=n_G*4), # Output: (64 x 4, 8, 8) 
G_block(in_channels=n_G*4, out_channels=n_G*2), # Output: (64 * 2, 16, 16) 
G_block(in_channels=n_G*2, out_channels=n_G) , # Output: (64, 32, 32) 
nn.ConvTranspose2d(in_channels=n_G, out_channels=3, 

kernel_size=4, stride=2, padding=1, bias=False), 

nn.Tanh()) # Output: (3, 64, 64) 


Generate a 100 dimensional latent variable to verify the generator’s output shape. 


x = torch.zeros((1, 100, 1, 1)) 
net_G(x).shape 


torch.Size([1, 3, 64, 64]) 


20.2.3 Discriminator 


The discriminator is a normal convolutional network network except that it uses a leaky 
ReLU as its activation function. Given a € [0, 1], its definition is 


x ifx >0 


leaky ReLU(x) = (20.2.2) 


ax otherwise ` 
As it can be seen, it is normal ReLU if œ = 0, and an identity function if œ = 1. For 
a € (0,1), leaky ReLU is a nonlinear function that give a non-zero output for a negative 
input. It aims to fix the “dying ReLU” problem that a neuron might always output a negative 
value and therefore cannot make any progress since the gradient of ReLU is 0. 


Alls S IO, o2y othy oy oly J 

x = torch.arange(-2, 1, 0.1) 

Y = [nn.LeakyReLU(alpha) (x).detach().numpy() for alpha in alphas] 
d21.plot(x.detach().numpy(), Y, ‘x’, ‘y’, alphas) 


The basic block of the discriminator is a convolution layer followed by a batch normalization 
layer and a leaky ReLU activation. The hyperparameters of the convolution layer are similar 
to the transpose convolution layer in the generator block. 
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class D_block(nn.Module): 


def __init__(self, out_channels, in_channels=3, kernel_size=4, strides=2, 


padding=1, alpha=0.2, **kwargs): 
super(D_block, self).__init__(**kwargs) 


self.conv2d = nn.Conv2d(in_channels, out_channels, kernel_size, 
strides, padding, bias=False) 

self.batch_norm = nn.BatchNorm2d(out_channels) 

self.activation = nn.LeakyReLU(alpha, inplace=True) 


def forward(self, X): 


return self.activation(self.batch_norm(self.conv2d(X))) 


A basic block with default settings will halve the width and height of the inputs, as we 
demonstrated in Section 7.3. For example, given a input shape np = ny = 16, with a kernel 
shape kp = ky = 4, astride shape sp = Sw = 2, and a padding shape pp = pw = 1, the 


output shape will be: 


n, X Ny = (nn — kn +2pn + sn)/sn] X Ltw — kw + 2pw + Sw)/ Sw 
= |(16-44+2x14+2)/2] x |(16-4+2x1+2)/2} (20.2.3) 


=8x8. 


x = torch.zeros((2, 3, 16, 16)) 
d_blk = D_block(2@) 
d_blk(x).shape 


torch.Size([2, 20, 8, 8]) 


The discriminator is a mirror of the generator. 


n_D 64 
net_D = nn.Sequential( 


D 
D_block(n_D), # Output: (64, 32, 32) 


D_block(in_channels=n_D, out_channels=n_D*2) , 
D_block(in_channels=n_D*2, out_channels=n_D*4) , 
D_block(in_channels=n_D*4, out_channels=n_D*8) , 
nn.Conv2d(in_channels=n_D*8, out_channels=1, 


# Output: (64 * 2, 16, 16) 
# Output: (64 * 4, 8, 8) 
# Output: (64 * 8, 4, 4) 


kernel_size=4, bias=False)) # Output: (1, 1, 1) 


It uses a convolution layer with output channel 1 as the last layer to obtain a single prediction 


value. 


x = torch.zeros((1, 3, 64, 64)) 
net_D(x).shape 


torch.Size([1, 1, 1, 1]) 


20.2.4 Training 
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Compared to the basic GAN in Section 20.1, we use the same learning rate for both gen- 
erator and discriminator since they are similar to each other. In addition, we change 6, in 
Adam (Section 12.10) from 0.9 to 0.5. It decreases the smoothness of the momentum, the 
exponentially weighted moving average of past gradients, to take care of the rapid changing 
gradients because the generator and the discriminator fight with each other. Besides, the 
random generated noise Z, is a 4-D tensor and we are using GPU to accelerate the compu- 
tation. 


def train(net_D, net_G, data_iter, num_epochs, lr, latent_dim, 
device=d21.try_gpu()): 
loss = nn.BCEWithLogitsLoss(reduction=' sum’) 
for w in net_D.parameters(): 
nn.init.normal_(w, 2, 0.02) 
for w in net_G.parameters(): 
nn.init.normal_(w, ð, 0.02) 
net_D, net_G = net_D.to(device), net_G.to(device) 
trainer_hp = {'1r': lr, ‘betas’: [0.5,0.999]} 
trainer_D = torch.optim.Adam(net_D.parameters(), **trainer_hp) 
trainer_G = torch.optim.Adam(net_G.parameters(), **trainer_hp) 
animator = d21.Animator(xlabel='epoch’, ylabel='loss’, 
xlim=[1, num_epochs], nrows=2, figsize=(5, 5), 
legend=['discriminator’, ‘generator’ ]) 
animator. fig.subplots_adjust (hspace=0. 3) 
for epoch in range(1, num_epochs + 1): 
# Train one epoch 
timer = d21.Timer() 
metric = d21.Accumulator(3) # loss_D, loss_G, num_examples 
for X, _ in data_iter: 
batch_size = X.shape[Q] 
Z = torch.normal(@, 1, size=(batch_size, latent_dim, 1, 1)) 
X, Z = X.to(device), Z.to(device) 
metric.add(d21.update_D(X, Z, net_D, net_G, loss, trainer_D), 
d21.update_G(Z, net_D, net_G, loss, trainer_G), 
batch_size) 
# Show generated examples 
Z = torch.normal(@, 1, size=(21, latent_dim, 1, 1), device=device) 
# Normalize the synthetic data to N(@, 1) 
fake_x = net_G(Z).permute(@, 2, 3, 1) / 2 + 0.5 
imgs = torch.cat( 
[torch.cat(L 
fake_x[i * 7 + j].cpuQ).detach() for j in range(7)], dim=1) 
for i in range(len(fake_x)//7)], dim=0) 
animator .axes[1].cla() 
animator .axes[1].imshow(imgs) 
# Show the losses 
loss_D, loss_G = metric[0] / metricl2], metric[1] / metric[2] 
animator.add(epoch, (loss_D, loss_G)) 
print(f’loss_D {loss_D:.3f}, loss G {loss_G:.3f}, ' 
f'{metric[2] / timer.stop():.1f} examples/sec on {str(device) }') 


We train the model with a small number of epochs just for demonstration. For better per- 
formance, the variable num_epochs can be set to a larger number. 
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latent_dim, lr, num_epochs = 100, 0.005, 20 
train(net_D, net_G, data_iter, num_epochs, lr, latent_dim) 


loss_D 0.023, loss_G 7.359, 2292.7 examples/sec on cuda:@ 
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20.2.5 Summary 


e DCGAN architecture has four convolutional layers for the Discriminator and four “fractionally- 
strided” convolutional layers for the Generator. 


e The Discriminator is a 4-layer strided convolutions with batch normalization (except its 
input layer) and leaky ReLU activations. 


e Leaky ReLU is a nonlinear function that give a non-zero output for a negative input. It 
aims to fix the “dying ReLU” problem and helps the gradients flow easier through the 
architecture. 


20.2.6 Exercises 
1. What will happen if we use standard ReLU activation rather than leaky ReLU? 


2. Apply DCGAN on Fashion-MNIST and see which category works well and which does 
not. 


Discussions2”" . 
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Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google) 


Recommender systems are widely employed in industry and are ubiquitous in our daily 
lives. These systems are utilized in a number of areas such as online shopping sites (e.g., 
amazon.com), music/movie services site (e.g., Netflix and Spotify), mobile application 
stores (e.g., IOS app store and google play), online advertising, just to name a few. 


The major goal of recommender systems is to help users discover relevant items such as 
movies to watch, text to read or products to buy, so as to create a delightful user experience. 
Moreover, recommender systems are among the most powerful machine learning systems 
that online retailers implement in order to drive incremental revenue. Recommender sys- 
tems are replacements of search engines by reducing the efforts in proactive searches and 
surprising users with offers they never searched for. Many companies managed to position 
themselves ahead of their competitors with the help of more effective recommender sys- 
tems. As such, recommender systems are central to not only our everyday lives but also 
highly indispensable in some industries. 


In this chapter, we will cover the fundamentals and advancements of recommender systems, 
along with exploring some common fundamental techniques for building recommender sys- 
tems with different data sources available and their implementations. Specifically, you will 
learn how to predict the rating a user might give to a prospective item, how to generate a 
recommendation list of items and how to predict the click-through rate from abundant fea- 
tures. These tasks are commonplace in real-world applications. By studying this chapter, 
you will get hands-on experience pertaining to solving real world recommendation prob- 
lems with not only classical methods but the more advanced deep learning based models 
as well. 


21.1 Overview of Recommender Systems 
[ERER 


In the last decade, the Internet has evolved into a platform for large-scale online services, 
which profoundly changed the way we communicate, read news, buy products, and watch 
movies. In the meanwhile, the unprecedented number of items (we use the term item to refer 
to movies, news, books, and products.) offered online requires a system that can help us 
discover items that we preferred. Recommender systems are therefore powerful information 
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filtering tools that can facilitate personalized services and provide tailored experience to 
individual users. In short, recommender systems play a pivotal role in utilizing the wealth 
of data available to make choices manageable. Nowadays, recommender systems are at 
the core of a number of online services providers such as Amazon, Netflix, and YouTube. 
Recall the example of Deep learning books recommended by Amazon in Fig. 1.3.3. The 
benefits of employing recommender systems are two-folds: On the one hand, it can largely 
reduce users’ effort in finding items and alleviate the issue of information overload. On 
the other hand, it can add business value to online service providers and is an important 
source of revenue. This chapter will introduce the fundamental concepts, classic models 
and recent advances with deep learning in the field of recommender systems, together with 
implemented examples. 


Application 


User Feedback 


Recommendation 
list 


Illustration of the Recommendation Process 


21.1.1 Collaborative Filtering 


We start the journey with the important concept in recommender systems—collaborative 
filtering (CF), which was first coined by the Tapestry system (Goldberg et al., 1992), re- 
ferring to “people collaborate to help one another perform the filtering process in order to 
handle the large amounts of email and messages posted to newsgroups”. This term has been 
enriched with more senses. In a broad sense, it is the process of filtering for information or 
patterns using techniques involving collaboration among multiple users, agents, and data 
sources. CF has many forms and numerous CF methods proposed since its advent. 


Overall, CF techniques can be categorized into: memory-based CF, model-based CF, and 
their hybrid (Su and Khoshgoftaar, 2009). Representative memory-based CF techniques 
are nearest neighbor-based CF such as user-based CF and item-based CF (Sarwar et al., 
2001). Latent factor models such as matrix factorization are examples of model-based CF. 
Memory-based CF has limitations in dealing with sparse and large-scale data since it com- 
putes the similarity values based on common items. Model-based methods become more 
popular with its better capability in dealing with sparsity and scalability. Many model- 
based CF approaches can be extended with neural networks, leading to more flexible and 
scalable models with the computation acceleration in deep learning (Zhang et al., 2019). 
In general, CF only uses the user-item interaction data to make predictions and recom- 
mendations. Besides CF, content-based and context-based recommender systems are also 
useful in incorporating the content descriptions of items/users and contextual signals such 
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as timestamps and locations. Obviously, we may need to adjust the model types/structures 
when different input data is available. 


21.1.2 Explicit Feedback and Implicit Feedback 


To learn the preference of users, the system shall collect feedback from them. The feedback 
can be either explicit or implicit (Hu et al., 2008). For example, IMDb 276 
ratings ranging from one to ten stars for movies. YouTube provides the thumbs-up and 
thumbs-down buttons for users to show their preferences. It is apparent that gathering 
explicit feedback requires users to indicate their interests proactively. Nonetheless, explicit 
feedback is not always readily available as many users may be reluctant to rate products. 
Relatively speaking, implicit feedback is often readily available since it is mainly concerned 
with modeling implicit behavior such as user clicks. As such, many recommender systems 
are centered on implicit feedback which indirectly reflects user’s opinion through observing 
user behavior. There are diverse forms of implicit feedback including purchase history, 
browsing history, watches and even mouse movements. For example, a user that purchased 
many books by the same author probably likes that author. Note that implicit feedback is 
inherently noisy. We can only guess their preferences and true motives. A user watched a 
movie does not necessarily indicate a positive view of that movie. 


collects star 


21.1.3 Recommendation Tasks 


A number of recommendation tasks have been investigated in the past decades. Based 
on the domain of applications, there are movies recommendation, news recommendations, 
point-of-interest recommendation (Ye et al., 2011) and so forth. It is also possible to dif- 
ferentiate the tasks based on the types of feedback and input data, for example, the rating 
prediction task aims to predict the explicit ratings. Top-n recommendation (item ranking) 
ranks all items for each user personally based on the implicit feedback. If time-stamp infor- 
mation is also included, we can build sequence-aware recommendation (Quadrana et al., 
2018). Another popular task is called click-through rate prediction, which is also based on 
implicit feedback, but various categorical features can be utilized. Recommending for new 
users and recommending new items to existing users are called cold-start recommendation 
(Schein et al., 2002). 


21.1.4 Summary 


e Recommender systems are important for individual users and industries. Collaborative 
filtering is a key concept in recommendation. 


e There are two types of feedbacks: implicit feedback and explicit feedback. A number of 
recommendation tasks have been explored during the last decade. 


21.1.5 Exercises 
1. Can you explain how recommender systems influence your daily life? 


2. What interesting recommendation tasks do you think can be investigated? 
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Learning 


Brent Werness (Amazon), Rachel Hu (Amazon), and authors of this book 


One of the wonderful parts of modern deep learning is the fact that much of it can be 
understood and used without a full understanding of the mathematics below it. This is a 
sign that the field is maturing. Just as most software developers no longer need to worry 
about the theory of computable functions, neither should deep learning practitioners need 
to worry about the theoretical foundations of maximum likelihood learning. 


But, we are not quite there yet. 


In practice, you will sometimes need to understand how architectural choices influence 
gradient flow, or the implicit assumptions you make by training with a certain loss function. 
You might need to know what in the world entropy measures, and how it can help you 
understand exactly what bits-per-character means in your model. These all require deeper 
mathematical understanding. 


This appendix aims to provide you the mathematical background you need to understand 
the core theory of modern deep learning, but it is not exhaustive. We will begin with 
examining linear algebra in greater depth. We develop a geometric understanding of all the 
common linear algebraic objects and operations that will enable us to visualize the effects 
of various transformations on our data. A key element is the development of the basics of 
eigen-decompositions. 


We next develop the theory of differential calculus to the point that we can fully understand 
why the gradient is the direction of steepest descent, and why back-propagation takes the 
form it does. Integral calculus is then discussed to the degree needed to support our next 
topic, probability theory. 


Problems encountered in practice frequently are not certain, and thus we need a language to 
speak about uncertain things. We review the theory of random variables and the most com- 
monly encountered distributions so we may discuss models probabilistically. This provides 
the foundation for the naive Bayes classifier, a probabilistic classification technique. 


Closely related to probability theory is the study of statistics. While statistics is far too 
large a field to do justice in a short section, we will introduce fundamental concepts that all 
machine learning practitioners should be aware of, in particular: evaluating and comparing 
estimators, conducting hypothesis tests, and constructing confidence intervals. 


Last, we turn to the topic of information theory, which is the mathematical study of infor- 
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mation storage and transmission. This provides the core language by which we may discuss 
quantitatively how much information a model holds on a domain of discourse. 


Taken together, these form the core of the mathematical concepts needed to begin down the 
path towards a deep understanding of deep learning. 


A.1 Geometry and Linear Algebraic Operations 
= —S— SEL IE ESS SSS SSS 


In Section 2.3, we encountered the basics of linear algebra and saw how it could be used 
to express common operations for transforming our data. Linear algebra is one of the key 
mathematical pillars underlying much of the work that we do in deep learning and in ma- 
chine learning more broadly. While Section 2.3 contained enough machinery to commu- 
nicate the mechanics of modern deep learning models, there is a lot more to the subject. 
In this section, we will go deeper, highlighting some geometric interpretations of linear 
algebra operations, and introducing a few fundamental concepts, including of eigenvalues 
and eigenvectors. 


A.1.1 Geometry of Vectors 


First, we need to discuss the two common geometric interpretations of vectors, as either 
points or directions in space. Fundamentally, a vector is a list of numbers such as the 
Python list below. 


VS, i 5 a 


Mathematicians most often write this as either a column or row vector, which is to say either 
as 


1 
7 

x= |o? (A.1) 
1 


x'={1 7 0 1]. (A.2) 


These often have different interpretations, where data examples are column vectors and 
weights used to form weighted sums are row vectors. However, it can be beneficial to be 
flexible. As we have described in Section 2.3, though a single vector’s default orientation is 
a column vector, for any matrix representing a tabular dataset, treating each data example 
as a row vector in the matrix is more conventional. 


Given a vector, the first interpretation that we should give it is as a point in space. In two 
or three dimensions, we can visualize these points by using the components of the vectors 
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to define the location of the points in space compared to a fixed reference called the origin. 
This can be seen in Fig. A.1. 


weet se Ne Meee eee es Se gees 


An illustration of visualizing vectors as points in the plane. The first component of the 
vector gives the z-coordinate, the second component gives the y-coordinate. Higher 
dimensions are analogous, although much harder to visualize. 


This geometric point of view allows us to consider the problem on a more abstract level. No 
longer faced with some insurmountable seeming problem like classifying pictures as either 
cats or dogs, we can start considering tasks abstractly as collections of points in space and 
picturing the task as discovering how to separate two distinct clusters of points. 


In parallel, there is a second point of view that people often take of vectors: as directions 
in space. Not only can we think of the vector v = [3,2] as the location 3 units to the right 
and 2 units up from the origin, we can also think of it as the direction itself to take 3 steps 
to the right and 2 steps up. In this way, we consider all the vectors in figure Fig. A.2 the 
same. 


Any vector can be visualized as an arrow in the plane. In this case, every vector drawn is a 
representation of the vector (3, 2)". 


One of the benefits of this shift is that we can make visual sense of the act of vector addition. 
In particular, we follow the directions given by one vector, and then follow the directions 
given by the other, as is seen in Fig. A.3. 


Vector subtraction has a similar interpretation. By considering the identity that u = v + 
(u — v), we see that the vector u — v is the direction that takes us from the point v to the 
point u. 


A.1.2 Dot Products and Angles 
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We can visualize vector addition by first following one vector, and then another. 


As we saw in Section 2.3, if we take two column vectors u and v, we can form their dot 
product by computing: 


u'v= X u “Vi. (A.3) 


i 
Because (A.3) is symmetric, we will mirror the notation of classical multiplication and 
write 


u-v=u'v=v'u, (A.4) 
to highlight the fact that exchanging the order of the vectors will yield the same answer. 


The dot product (A.3) also admits a geometric interpretation: it is closely related to the 
angle between two vectors. Consider the angle shown in Fig. A.4. 


Between any two vectors in the plane there is a well defined angle 6. We will see this 
angle is intimately tied to the dot product. 


To start, let’s consider two specific vectors: 
v = (r,0) and w = (scos(@), s sin(@)). (A.5) 


The vector v is length r and runs parallel to the x-axis, and the vector w is of length s 
and at angle @ with the x-axis. If we compute the dot product of these two vectors, we see 
that 


v- w =rscos(0) = ||v||||w|| cos(@). (A.6) 


With some simple algebraic manipulation, we can rearrange terms to obtain 


@ = arccos (ae (A.7) 


IIv{IIlwll 
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In short, for these two specific vectors, the dot product combined with the norms tell us 
the angle between the two vectors. This same fact is true in general. We will not derive 
the expression here, however, if we consider writing ||v — w||? in two ways: one with 
the dot product, and the other geometrically using the law of cosines, we can obtain the 
full relationship. Indeed, for any two vectors v and w, the angle between the two vectors 
is 

0 = arccos a] ; (A.8) 

Ivilliwl 

This is a nice result since nothing in the computation references two-dimensions. Indeed, 
we can use this in three or three million dimensions without issue. 


As a simple example, let’s see how to compute the angle between a pair of vectors: 


%matplotlib inline 

import torch 

import torchvision 

from IPython import display 

from torchvision import transforms 
from d21 import torch as d21 


def angle(v, w): 
return torch.acos(v.dot(w) / (torch.norm(v) * torch.norm(w))) 


angle(torch.tensor([@, 1, 2], dtype=torch.float32), torch.tensor([2.0, 3, 4])) 


tensor(@.4190) 


We will not use it right now, but it is useful to know that we will refer to vectors for which 
the angle is 2/2 (or equivalently 90°) as being orthogonal. By examining the equation 
above, we see that this happens when 6 = 7/2, which is the same thing as cos(@) = 0. The 
only way this can happen is if the dot product itself is zero, and two vectors are orthogonal 
if and only if v -w = 0. This will prove to be a helpful formula when understanding objects 
geometrically. 


It is reasonable to ask: why is computing the angle useful? The answer comes in the kind 
of invariance we expect data to have. Consider an image, and a duplicate image, where 
every pixel value is the same but 10% the brightness. The values of the individual pixels 
are in general far from the original values. Thus, if one computed the distance between 
the original image and the darker one, the distance can be large. However, for most ML 
applications, the content is the same—it is still an image of a cat as far as a cat/dog classifier 
is concerned. However, if we consider the angle, it is not hard to see that for any vector v, 
the angle between v and 0.1 - v is zero. This corresponds to the fact that scaling vectors 
keeps the same direction and just changes the length. The angle considers the darker image 
identical. 


Examples like this are everywhere. In text, we might want the topic being discussed to 
not change if we write twice as long of document that says the same thing. For some 
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encoding (such as counting the number of occurrences of words in some vocabulary), this 
corresponds to a doubling of the vector encoding the document, so again we can use the 
angle. 


Cosine Similarity 


In ML contexts where the angle is employed to measure the closeness of two vectors, prac- 
titioners adopt the term cosine similarity to refer to the portion 


Vv: Ww 


cos(@) = (A.9) 


IIvililwil 
The cosine takes a maximum value of 1 when the two vectors point in the same direction, 
a minimum value of —1 when they point in opposite directions, and a value of 0 when the 
two vectors are orthogonal. Note that if the components of high-dimensional vectors are 
sampled randomly with mean 0, their cosine will nearly always be close to 0. 


A.1.3  Hyperplanes 


In addition to working with vectors, another key object that you must understand to go far 
in linear algebra is the hyperplane, a generalization to higher dimensions of a line (two di- 
mensions) or of a plane (three dimensions). In an d-dimensional vector space, a hyperplane 
has d — 1 dimensions and divides the space into two half-spaces. 


Let’s start with an example. Suppose that we have a column vector w = [2, 1]". We want 
to know, “what are the points v with w - v = 1?” By recalling the connection between dot 
products and angles above (A.8), we can see that this is equivalent to 
1 1 
IIv [Il w]| cos(@) = 1 = |v] cos(@) = ——> = —. A.10 
Iwill v5 Sa 


livi] - cos(@) 


Y 
Recalling trigonometry, we see the formula ||v|| cos(@) is the length of the projection of 
the vector v onto the direction of w 


If we consider the geometric meaning of this expression, we see that this is equivalent to 
saying that the length of the projection of v onto the direction of w is exactly 1/||w]|, as 
is shown in Fig. A.5. The set of all points where this is true is a line at right angles to the 
vector w. If we wanted, we could find the equation for this line and see that it is 2x + y = 1 
or equivalently y = 1 — 2x. 
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If we now look at what happens when we ask about the set of points with w-v > 1 or 
wv < 1, we can see that these are cases where the projections are longer or shorter than 
1/||w||, respectively. Thus, those two inequalities define either side of the line. In this way, 
we have found a way to cut our space into two halves, where all the points on one side have 
dot product below a threshold, and the other side above as we see in Fig. A.6. 


v-w<l v-w=1 vew>l 


If we now consider the inequality version of the expression, we see that our hyperplane (in 
this case: just a line) separates the space into two halves. 


The story in higher dimension is much the same. If we now take w = [1,2,3]" and ask 
about the points in three dimensions with w- v = 1, we obtain a plane at right angles to the 
given vector w. The two inequalities again define the two sides of the plane as is shown in 
Fig. A.7. 


vew<l vew>l 


Hyperplanes in any dimension separate the space into two halves. 


While our ability to visualize runs out at this point, nothing stops us from doing this in 
tens, hundreds, or billions of dimensions. This occurs often when thinking about machine 
learned models. For instance, we can understand linear classification models like those 
from Section 4.1, as methods to find hyperplanes that separate the different target classes. 
In this context, such hyperplanes are often referred to as decision planes. The majority of 
deep learned classification models end with a linear layer fed into a softmax, so one can 
interpret the role of the deep neural network to be to find a non-linear embedding such that 
the target classes can be separated cleanly by hyperplanes. 


To give a hand-built example, notice that we can produce a reasonable model to classify 
tiny images of t-shirts and trousers from the Fashion-MNIST dataset (seen in Section 4.2) 
by just taking the vector between their means to define the decision plane and eyeball a 
crude threshold. First we will load the data and compute the averages. 


# Load in the dataset 
trans = [] 


(continues on next page) 
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(continued from previous page) 


trans.append(transforms.ToTensor()) 

trans = transforms.Compose(trans) 

train = torchvision.datasets.FashionMNIST(root="../data", transform=trans, 
train=True, download=True) 

test = torchvision.datasets.FashionMNIST(root="../data”, transform=trans, 
train=False, download=True) 


X_train_@ = torch.stack( 

[x[@] * 256 for x in train if x[1] == 0]).type(torch. float32) 
X_train_1 = torch.stack( 

[x[@] * 256 for x in train if x[1] == 1]).type(torch. float32) 
X_test = torch.stack( 

[x[@] * 256 for x in test if x[1] == ð or x[1] == 1]).type(torch. float32) 
y_test = torch.stack([torch.tensor(xLl1]) for x in test 

if x[1] == @ or x[1] == 1]).type(torch. float32) 


# Compute averages 
ave_Q@ = torch.mean(X_train_@, axis=0) 
ave_1 = torch.mean(X_train_1, axis=0) 


It can be informative to examine these averages in detail, so let’s plot what they look like. 
In this case, we see that the average indeed resembles a blurry image of a t-shirt. 


# Plot average t-shirt 

d21.set_figsize() 

d21.plt.imshow(ave_@.reshape(28, 28).tolist(), cmap='Greys’') 
d21.plt.show() 


In the second case, we again see that the average resembles a blurry image of trousers. 


# Plot average trousers 
d21.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys’') 
d21.plt.show() 


In a fully machine learned solution, we would learn the threshold from the dataset. In this 
case, I simply eyeballed a threshold that looked good on the training data by hand. 


# Print test set accuracy with eyeballed threshold 
w = (ave_1 - ave_Q).T 


(continues on next page) 
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(continued from previous page) 


# '@' is Matrix Multiplication operator in pytorch. 
predictions = X_test.reshape(2000, -1) @ (w.flatten()) > -1500000 


# Accuracy 
torch.mean((predictions.type(y_test.dtype) == y_test).float(), dtype=torch. 
—float64) 


tensor(@.7870, dtype=torch. float64) 


A.1.4 Geometry of Linear Transformations 


Through Section 2.3 and the above discussions, we have a solid understanding of the geom- 
etry of vectors, lengths, and angles. However, there is one important object we have omitted 
discussing, and that is a geometric understanding of linear transformations represented by 
matrices. Fully internalizing what matrices can do to transform data between two poten- 
tially different high dimensional spaces takes significant practice, and is beyond the scope 
of this appendix. However, we can start building up intuition in two dimensions. 


Suppose that we have some matrix: 


A= 


d (A.11) 


If we want to apply this to an arbitrary vector v = [x, y]", we multiply and see that 


(A.12) 


=xfA 


en 


0 


i 


This may seem like an odd computation, where something clear became somewhat impen- 
etrable. However, it tells us that we can write the way that a matrix transforms any vector 
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in terms of how it transforms two specific vectors: [1,0]" and [0, 1]". This is worth con- 
sidering for a moment. We have essentially reduced an infinite problem (what happens to 
any pair of real numbers) to a finite one (what happens to these specific vectors). These 
vectors are an example a basis, where we can write any vector in our space as a weighted 
sum of these basis vectors. 


Let’s draw what happens when we use the specific matrix 
1 2 
A= | : (A.13) 


If we look at the specific vector v = [2,—1]", we see this is 2 - [1,0]" +—1 - [0,1]", and 
thus we know that the matrix A will send this to 2(A[1,0]")+-1(A[0, 1])" = 2[1,-1]7- 
[2,3]" = [0,-5]'. If we follow this logic through carefully, say by considering the grid 
of all integer pairs of points, we see that what happens is that the matrix multiplication 
can skew, rotate, and scale the grid, but the grid structure must remain as you see in Fig. 
A.8. 


The matrix A acting on the given basis vectors. Notice how the entire grid is transported 
along with it. 


This is the most important intuitive point to internalize about linear transformations rep- 
resented by matrices. Matrices are incapable of distorting some parts of space differently 
than others. All they can do is take the original coordinates on our space and skew, rotate, 
and scale them. 


Some distortions can be severe. For instance the matrix 


B=]; 2 


2 sh (A.14) 


compresses the entire two-dimensional plane down to a single line. Identifying and working 
with such transformations are the topic of a later section, but geometrically we can see 
that this is fundamentally different from the types of transformations we saw above. For 
instance, the result from matrix A can be “bent back” to the original grid. The results from 
matrix B cannot because we will never know where the vector [1,2]' came from—was it 
[1,1]" or [0,-1]™? 


While this picture was for a 2x2 matrix, nothing prevents us from taking the lessons learned 
into higher dimensions. If we take similar basis vectors like [1, 0, ... ,0] and see where our 
matrix sends them, we can start to get a feeling for how the matrix multiplication distorts 
the entire space in whatever dimension space we are dealing with. 
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A.1.5 Linear Dependence 


Consider again the matrix 


B= 


2 -1 
—_ | d (A.15) 
This compresses the entire plane down to live on the single line y = 2x. The question now 
arises: is there some way we can detect this just looking at the matrix itself? The answer is 
that indeed we can. Let’s take bı = [2,4]' and by = [—1, —2]" be the two columns of B. 
Remember that we can write everything transformed by the matrix B as a weighted sum of 
the columns of the matrix: like a,b; + ab. We call this a linear combination. The fact 
that bj = —2- b2 means that we can write any linear combination of those two columns 
entirely in terms of say bz since 


a,b, + anb2 = —2a,b2 + anb2 = (a2 = 2aı)b2. (A.16) 


This means that one of the columns is, in a sense, redundant because it does not define a 
unique direction in space. This should not surprise us too much since we already saw that 
this matrix collapses the entire plane down into a single line. Moreover, we see that the 
linear dependence bı = —2 - bz captures this. To make this more symmetrical between the 
two vectors, we will write this as 


bı +2. b2 =0. (A.17) 
In general, we will say that a collection of vectors vj, ..., Vx are linearly dependent if there 
exist coefficients a1,...,a, not all equal to zero so that 
k 


X aivi = 0. (A.18) 

i=l 
In this case, we can solve for one of the vectors in terms of some combination of the others, 
and effectively render it redundant. Thus, a linear dependence in the columns of a matrix 
is a witness to the fact that our matrix is compressing the space down to some lower di- 
mension. If there is no linear dependence we say the vectors are linearly independent. If 
the columns of a matrix are linearly independent, no compression occurs and the operation 
can be undone. 


A.1.6 Rank 


If we have a general n x m matrix, it is reasonable to ask what dimension space the matrix 
maps into. A concept known as the rank will be our answer. In the previous section, we 
noted that a linear dependence bears witness to compression of space into a lower dimension 
and so we will be able to use this to define the notion of rank. In particular, the rank of 
a matrix A is the largest number of linearly independent columns amongst all subsets of 
columns. For example, the matrix 


B-|? I. (A.19) 
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has rank(B) = 1, since the two columns are linearly dependent, but either column by itself 
is not linearly dependent. For a more challenging example, we can consider 


1 3 0 -1 0 
-1 0 1 1 -i 

Plo. eae OO, etl ey) 
2 3 -1 -2 1 


and show that C has rank two since, for instance, the first two columns are linearly inde- 
pendent, however any of the four collections of three columns are dependent. 


This procedure, as described, is very inefficient. It requires looking at every subset of the 
columns of our given matrix, and thus is potentially exponential in the number of columns. 
Later we will see a more computationally efficient way to compute the rank of a matrix, 
but for now, this is sufficient to see that the concept is well defined and understand the 
meaning. 


A.1.7 Invertibility 


We have seen above that multiplication by a matrix with linearly dependent columns cannot 
be undone, i.e., there is no inverse operation that can always recover the input. However, 
multiplication by a full-rank matrix (i.e., some A that is n x n matrix with rank n), we 
should always be able to undo it. Consider the matrix 


O1 0 
I= (A.21) 


which is the matrix with ones along the diagonal, and zeros elsewhere. We call this the 
identity matrix. It is the matrix which leaves our data unchanged when applied. To find 
a matrix which undoes what our matrix A has done, we want to find a matrix AT! such 
that 


ATA=AATEL (A.22) 


If we look at this as a system, we have n x n unknowns (the entries of Aq!) andnxn 
equations (the equality that needs to hold between every entry of the product A~!A and 
every entry of I) so we should generically expect a solution to exist. Indeed, in the next 
section we will see a quantity called the determinant, which has the property that as long as 
the determinant is not zero, we can find a solution. We call such a matrix A~! the inverse 
matrix. As an example, if A is the general 2 x 2 matrix 


A= É | l (A.23) 


then we can see that the inverse is 


: | a 7| . (A.24) 
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We can test to see this by seeing that multiplying by the inverse given by the formula above 
works in practice. 


M = torch.tensor([[1, 2], [1, 4]], dtype=torch. float32) 
M_inv = torch.tensor([[2, -1], [-0.5, @.5]]) 
M_inv @ M 


tensor ([[1., ð. 
[0., 1 


Numerical Issues 


While the inverse of a matrix is useful in theory, we must say that most of the time we do 
not wish to use the matrix inverse to solve a problem in practice. In general, there are far 
more numerically stable algorithms for solving linear equations like 


Ax =b, (A.25) 
than computing the inverse and multiplying to get 
x= A!b. (A.26) 


Just as division by a small number can lead to numerical instability, so can inversion of a 
matrix which is close to having low rank. 


Moreover, it is common that the matrix A is sparse, which is to say that it contains only a 
small number of non-zero values. If we were to explore examples, we would see that this 
does not mean the inverse is sparse. Even if A was a 1 million by 1 million matrix with only 
5 million non-zero entries (and thus we need only store those 5 million), the inverse will 
typically have almost every entry non-negative, requiring us to store all 1M? entries—that 
is 1 trillion entries! 


While we do not have time to dive all the way into the thorny numerical issues frequently 
encountered when working with linear algebra, we want to provide you with some intuition 
about when to proceed with caution, and generally avoiding inversion in practice is a good 
rule of thumb. 


A.1.8 Determinant 


The geometric view of linear algebra gives an intuitive way to interpret a fundamental 
quantity known as the determinant. Consider the grid image from before, but now with a 
highlighted region (Fig. A.9). 


Look at the highlighted square. This is a square with edges given by (0, 1) and (1,0) and 
thus it has area one. After A transforms this square, we see that it becomes a parallelogram. 
There is no reason this parallelogram should have the same area that we started with, and 
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The matrix A again distorting the grid. This time, I want to draw particular attention to 
what happens to the highlighted square. 


indeed in the specific case shown here of 


1 2 

|. (A.27) 
itis an exercise in coordinate geometry to compute the area of this parallelogram and obtain 
that the area is 5. 


In general, if we have a matrix 
Ae h , (A.28) 
c 


we can see with some computation that the area of the resulting parallelogram is ad — bc. 
This area is referred to as the determinant. 


Let’s check this quickly with some example code. 


torch.det(torch.tensor([[1, -1], [2, 3]], dtype=torch.float32)) 


tensor (5.) 


The eagle-eyed amongst us will notice that this expression can be zero or even negative. 
For the negative term, this is a matter of convention taken generally in mathematics: if the 
matrix flips the figure, we say the area is negated. Let’s see now that when the determinant 
is zero, we learn more. 


Let’s consider 


B= be H . (A.29) 


If we compute the determinant of this matrix, we get 2 - (—2) — 4- (-1) = 0. Given our 
understanding above, this makes sense. B compresses the square from the original image 
down to a line segment, which has zero area. And indeed, being compressed into a lower 
dimensional space is the only way to have zero area after the transformation. Thus we see 
the following result is true: a matrix A is invertible if and only if the determinant is not 
equal to zero. 
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As a final comment, imagine that we have any figure drawn on the plane. Thinking like 
computer scientists, we can decompose that figure into a collection of little squares so that 
the area of the figure is in essence just the number of squares in the decomposition. If we 
now transform that figure by a matrix, we send each of these squares to parallelograms, each 
one of which has area given by the determinant. We see that for any figure, the determinant 
gives the (signed) number that a matrix scales the area of any figure. 


Computing determinants for larger matrices can be laborious, but the intuition is the same. 
The determinant remains the factor that nxn matrices scale n-dimensional volumes. 


A.1.9 Tensors and Common Linear Algebra Operations 


In Section 2.3 the concept of tensors was introduced. In this section, we will dive more 
deeply into tensor contractions (the tensor equivalent of matrix multiplication), and see 
how it can provide a unified view on a number of matrix and vector operations. 


With matrices and vectors we knew how to multiply them to transform data. We need 
to have a similar definition for tensors if they are to be useful to us. Think about matrix 
multiplication: 


C= AB, (A.30) 


or equivalently 
a j = z di,kbk, j. (A.31) 
k 


This pattern is one we can repeat for tensors. For tensors, there is no one case of what to 
sum over that can be universally chosen, so we need specify exactly which indices we want 
to sum over. For instance we could consider 


Jil = X xijn1je- (A.32) 
jk 


Such a transformation is called a tensor contraction. It can represent a far more flexible 
family of transformations that matrix multiplication alone. 


As a often-used notational simplification, we can notice that the sum is over exactly those 
indices that occur more than once in the expression, thus people often work with Einstein 
notation, where the summation is implicitly taken over all repeated indices. This gives the 
compact expression: 


Yil = XijklAjk. (A.33) 


Common Examples from Linear Algebra 


Let’s see how many of the linear algebraic definitions we have seen before can be expressed 
in this compressed tensor notation: 


evew= Divi 


© livlls = Di viv: 
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(2 (Av); = oF, QijVj 
e (AB)ix = bj 4ijb jx 
o tr(A) = X; aii 


In this way, we can replace a myriad of specialized notations with short tensor expres- 
sions. 


Expressing in Code 


Tensors may flexibly be operated on in code as well. As seen in Section 2.3, we can create 
tensors as is shown below. 


Define tensors 

= tora. tensor CEL, 2, sl, (4, S, Gill, (liv, & Gl, (lO, ial, Wea 
= torch.tensor([[1, 2], [3, 411) 

= torch.tensor([1, 21) 


<> 0# 


# Print out the shapes 
A.shape, B.shape, v.shape 


(torch.Size([2, 2]), torch.Size([2, 2, 3]), torch.Size([2])) 


Einstein summation has been implemented directly. The indices that occurs in the Einstein 
summation can be passed as a string, followed by the tensors that are being acted upon. For 
instance, to implement matrix multiplication, we can consider the Einstein summation seen 
above (Av = aijv j) and strip out the indices themselves to get the implementation: 


# Reimplement matrix multiplication 
torch.einsum("ij, j -> i”, A, v), A@v 


(tensor([ 5, 11]), tensor([ 5, 11])) 


This is a highly flexible notation. For instance if we want to compute what would be tradi- 
tionally written as 


Ckl = X bijraavj. (A.34) 


ij 


it can be implemented via Einstein summation as: 


torch.einsum("ijk, il, j -> kl”, B, A, v) 


tensor ([[ 90, 126], 
[102, 144], 
[114, 16211) 
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This notation is readable and efficient for humans, however bulky if for whatever reason we 
need to generate a tensor contraction programmatically. For this reason, einsum provides 
an alternative notation by providing integer indices for each tensor. For example, the same 
tensor contraction can also be written as: 


# PyTorch does not support this type of notation. 


Either notation allows for concise and efficient representation of tensor contractions in 
code. 


A.1.10 Summary 
e Vectors can be interpreted geometrically as either points or directions in space. 
e Dot products define the notion of angle to arbitrarily high-dimensional spaces. 


e Hyperplanes are high-dimensional generalizations of lines and planes. They can be used 
to define decision planes that are often used as the last step in a classification task. 


e Matrix multiplication can be geometrically interpreted as uniform distortions of the un- 
derlying coordinates. They represent a very restricted, but mathematically clean, way 
to transform vectors. 


e Linear dependence is a way to tell when a collection of vectors are in a lower dimensional 
space than we would expect (say you have 3 vectors living in a 2-dimensional space). 
The rank of a matrix is the size of the largest subset of its columns that are linearly 
independent. 


e When a matrix’s inverse is defined, matrix inversion allows us to find another matrix that 
undoes the action of the first. Matrix inversion is useful in theory, but requires care in 
practice owing to numerical instability. 


e Determinants allow us to measure how much a matrix expands or contracts a space. A 
nonzero determinant implies an invertible (non-singular) matrix and a zero-valued 
determinant means that the matrix is non-invertible (singular). 


e Tensor contractions and Einstein summation provide for a neat and clean notation for 
expressing many of the computations that are seen in machine learning. 


A.1.11 Exercises 


1. What is the angle between 


1 3 
3 0 3 1 
= = 9 
V1 -] > v2 0 Hs (A.35) 
2 1 
I 2 - ‘ 
2. True or false: 01 and 0 1 are inverses of one another? 
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3. Suppose that we draw a shape in the plane with area 100m. What is the area after 
transforming the figure by the matrix 


Fi 4 . (A.36) 


4. Which of the following sets of vectors are linearly independent? 


1 2 3 
e > 1 bd 

-1} \-1 

3\ /1\ /0 
° 111,10 

0 

1 0 1 
e 1j, 1], 

OJ \-1/ \1 


` [a b| for some choice of values 


5. Suppose that you have a matrix written as A = ki 


a,b,c, and d. True or false: the determinant of such a matrix is always 0? 
are orthogonal. What is the condition on a matrix A 


6. The vectors e; = and e2 = 


1 
0 1 
so that Ae; and Ae, are orthogonal? 


7. How can you write tr( A4) in Einstein notation for an arbitrary matrix A? 


Discussions 2°° . 


A.2 Eigendecompositions 
————————EEEEEOE>>>~~_ 


Eigenvalues are often one of the most useful notions we will encounter when studying linear 
algebra, however, as a beginner, it is easy to overlook their importance. Below, we introduce 
eigendecomposition and try to convey some sense of just why it is so important. 


Suppose that we have a matrix A with the following entries: 


A= (A.1) 


2 0 
0 -1j 
If we apply A to any vector v = [x, y]", we obtain a vector Av = [2x,—y]". This has an 


intuitive interpretation: stretch the vector to be twice as wide in the x-direction, and then 
flip it in the y-direction. 
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However, there are some vectors for which something remains unchanged. Namely [1,0]* 
gets sent to [2,0]" and [0,1]" gets sent to [0,—1]". These vectors are still in the same 
line, and the only modification is that the matrix stretches them by a factor of 2 and —1 
respectively. We call such vectors eigenvectors and the factor they are stretched by eigen- 
values. 


In general, if we can find a number J and a vector v such that 
Av =Av. (A.2) 


We say that v is an eigenvector for A and J is an eigenvalue. 


A.2.1 Finding Eigenvalues 


Let’s figure out how to find them. By subtracting off the Av from both sides, and then 
factoring out the vector, we see the above is equivalent to: 


(A —ADv =0. (A.3) 


For (A.3) to happen, we see that (A — AI) must compress some direction down to zero, 
hence it is not invertible, and thus the determinant is zero. Thus, we can find the eigenvalues 
by finding for what 4 is det(A — AT) = 0. Once we find the eigenvalues, we can solve 
Av = Av to find the associated eigenvector(s). 


An Example 


Let’s see this with a more challenging matrix 


A= 


; ; (A.4) 


If we consider det(A — AI) = 0, we see this is equivalent to the polynomial equation 
0 = (2-A)(3 -A) -2 = (4-A)C. — A). Thus, two eigenvalues are 4 and 1. To find the 
associated vectors, we then need to solve 


EAB os 


2 1| Ix 
2 3) \y 


We can solve this with the vectors [1,—1]" and [1,2]™ respectively. 


We can check this in code using the built-in numpy. linalg. eig routine. 


%matplotlib inline 

import torch 

from IPython import display 
from d21 import torch as d21 


torch. linalg.eig(torch.tensor([[2, 1], [2, 3]], dtype=torch.float64)) 
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torch. return_types. linalg_eig( 
eigenvalues=tensor([1.+@.j, 4.+0.j], dtype=torch.complex128) , 
eigenvectors=tensor([[-0.7071+0.j, -0.4472+0.j], 

[ 0.7071+0.j, -®.8944+0.j]], dtype=torch.complex128)) 


Note that numpy normalizes the eigenvectors to be of length one, whereas we took ours to 
be of arbitrary length. Additionally, the choice of sign is arbitrary. However, the vectors 
computed are parallel to the ones we found by hand with the same eigenvalues. 


A.2.2 Decomposing Matrices 


Let’s continue the previous example one step further. Let 


1 1 
W = A. 
p a (A.6) 
be the matrix where the columns are the eigenvectors of the matrix A. Let 
1 0 
B= À. 
e (A.7) 


be the matrix with the associated eigenvalues on the diagonal. Then the definition of eigen- 
values and eigenvectors tells us that 


AW = WS. (A.8) 


The matrix W is invertible, so we may multiply both sides by W~! on the right, we see that 
we may write 


A=Wrw'!. (A.9) 


In the next section we will see some nice consequences of this, but for now we need only 
know that such a decomposition will exist as long as we can find a full collection of linearly 
independent eigenvectors (so that W is invertible). 


A.2.3 Operations on Eigendecompositions 


One nice thing about eigendecompositions (A.9) is that we can write many operations we 
usually encounter cleanly in terms of the eigendecomposition. As a first example, con- 
sider: 


n times n times n times 
—_—_—$—>?>?>> m——T""""“—_.. 
pa arene (A.10) 


A"=A---A=(WZW"!).--(WrW"')=WE--- EW! = WE'W., 


This tells us that for any positive power of a matrix, the eigendecomposition is obtained by 
just raising the eigenvalues to the same power. The same can be shown for negative powers, 
so if we want to invert a matrix we need only consider 


A`! = WEW !, (A.11) 
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or in other words, just invert each eigenvalue. This will work as long as each eigenvalue is 
non-zero, so we see that invertible is the same as having no zero eigenvalues. 


Indeed, additional work can show that if 41, ..., An are the eigenvalues of a matrix, then 
the determinant of that matrix is 


det(A) =Ay-+-An, (A.12) 


or the product of all the eigenvalues. This makes sense intuitively because whatever stretch- 
ing W does, W~! undoes it, so in the end the only stretching that happens is by multipli- 
cation by the diagonal matrix Ł, which stretches volumes by the product of the diagonal 
elements. 


Finally, recall that the rank was the maximum number of linearly independent columns of 
your matrix. By examining the eigendecomposition closely, we can see that the rank is the 
same as the number of non-zero eigenvalues of A. 


The examples could continue, but hopefully the point is clear: eigendecomposition can 
simplify many linear-algebraic computations and is a fundamental operation underlying 
many numerical algorithms and much of the analysis that we do in linear algebra. 


A.2.4 Eigendecompositions of Symmetric Matrices 


It is not always possible to find enough linearly independent eigenvectors for the above 
process to work. For instance the matrix 


(A.13) 


has only a single eigenvector, namely (1,0)'. To handle such matrices, we require more 
advanced techniques than we can cover (such as the Jordan Normal Form, or Singular Value 
Decomposition). We will often need to restrict our attention to those matrices where we 
can guarantee the existence of a full set of eigenvectors. 


The most commonly encountered family are the symmetric matrices, which are those ma- 
trices where A = A". In this case, we may take W to be an orthogonal matrix—a matrix 
whose columns are all length one vectors that are at right angles to one another, where 
WT = W~!—and all the eigenvalues will be real. Thus, in this special case, we can write 
(A.9) as 


A=WIW'. (A.14) 


A.2.5 Gershgorin Circle Theorem 


Eigenvalues are often difficult to reason with intuitively. If presented an arbitrary matrix, 
there is little that can be said about what the eigenvalues are without computing them. There 
is, however, one theorem that can make it easy to approximate well if the largest values are 
on the diagonal. 


Let A = (a;;) be any square matrix (n x n). We will define r; = È jz; |aij|. Let D; 
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represent the disc in the complex plane with center aj; radius r;. Then, every eigenvalue of 
A is contained in one of the D;. 


This can be a bit to unpack, so let’s look at an example. Consider the matrix: 


1.0 0.1 0.1 O.1 
0.1 3.0 0.2 03 
a 0.1 0.2 5.0 0.5] ° oo 


0.1 0.3 05 9.0 


We have rı = 0.3, r2 = 0.6, 73 = 0.8 and r4 = 0.9. The matrix is symmetric, so all 
eigenvalues are real. This means that all of our eigenvalues will be in one of the ranges 
of 


[ai — 11,411 +71] = [0.7, 1.3], (A.16) 
[a22 — r2,d22 + r2] = [2.4, 3.6], (A.17) 
[a33 = r3, a33 + r3] = [4.2,5.8], (A.18) 
[a44 — r4, a44 + r4] = [8.1,9.9]. (A.19) 


Performing the numerical computation shows that the eigenvalues are approximately 0.99, 
2.97, 4.95, 9.08, all comfortably inside the ranges provided. 


A = torch.tensor([[1.0, 0.1, 0.1, 9.1], 


(Oi, 3.6, G2, OS, 

[0.1, 0.2, 5.0, 0.5], 

[0.1, 0.3, 0.5, 9.0]]) 
v, — = torch. linalg.eig(A) 


Vv 


tensor ([0.9923+0.j, 9.0803+0.j, 4.9539+0.j, 2.97344+0.j]) 


In this way, eigenvalues can be approximated, and the approximations will be fairly accurate 
in the case that the diagonal is significantly larger than all the other elements. 


It is a small thing, but with a complex and subtle topic like eigendecomposition, it is good 
to get any intuitive grasp we can. 


A.2.6 A Useful Application: The Growth of Iterated Maps 


Now that we understand what eigenvectors are in principle, let’s see how they can be used 
to provide a deep understanding of a problem central to neural network behavior: proper 
weight initialization. 
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EKigenvectors as Long Term Behavior 


The full mathematical investigation of the initialization of deep neural networks is beyond 
the scope of the text, but we can see a toy version here to understand how eigenvalues can 
help us see how these models work. As we know, neural networks operate by interspersing 
layers of linear transformations with non-linear operations. For simplicity here, we will 
assume that there is no non-linearity, and that the transformation is a single repeated matrix 
operation A, so that the output of our model is 


N 
Vout = A-A-+:Avin = A” Vin. (A.20) 
When these models are initialized, A is taken to be a random matrix with Gaussian entries, 
so let’s make one of those. To be concrete, we start with a mean zero, variance one Gaussian 


distributed 5 x 5 matrix. 


torch.manual_seed(42) 


k=5 
A = torch.randn(k, k, dtype=torch. float64) 
A 


tensor([[ 0.2996, 0.2424, 0.2832, -0.2329, 0.6712], 
[ 0.7818, -1.7903, -1.7484, @.1735, -@.1182], 
[-1.7446, -@.4695, 0.4573, 0.5177, -@.2771], 
[-0.6641, 0.6551, 0.2616, -1.5265, -@.3311], 
[-0.6378, 0.1072, @.7096, 0.3009, -@.2869]], dtype=torch.float64) 


Behavior on Random Data 


For simplicity in our toy model, we will assume that the data vector we feed in vj, is a 
random five dimensional Gaussian vector. Let’s think about what we want to have happen. 
For context, lets think of a generic ML problem, where we are trying to turn input data, like 
an image, into a prediction, like the probability the image is a picture of a cat. If repeated 
application of A stretches a random vector out to be very long, then small changes in input 
will be amplified into large changes in output—tiny modifications of the input image would 
lead to vastly different predictions. This does not seem right! 


On the flip side, if A shrinks random vectors to be shorter, then after running through many 
layers, the vector will essentially shrink to nothing, and the output will not depend on the 
input. This is also clearly not right either! 


We need to walk the narrow line between growth and decay to make sure that our output 
changes depending on our input, but not much! 


Let’s see what happens when we repeatedly multiply our matrix A against a random input 
vector, and keep track of the norm. 
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# Calculate the sequence of norms after repeatedly applying `A` 
v_in = torch.randn(k, 1, dtype=torch. float64) 


norm_list = [torch.norm(v_in).item()] 

for i in range(1, 100): 
v_in = A @ v_in 
norm_list.append(torch.norm(v_in) .item()) 


d21.plot(torch.arange(@, 100), norm_list, ‘Iteration’, 'Value’) 


1e38 


20 40 60 80 100 
Iteration 


o-4 


The norm is growing uncontrollably! Indeed if we take the list of quotients, we will see a 
pattern. 


# Compute the scaling factor of the norms 

norm_ratio_list = [] 

for i in range(1, 100): 
norm_ratio_list.append(norm_listLil]/norm_listL[i - 1]) 


d21.plot(torch.arange(1, 100), norm_ratio_list, ‘Iteration’, 'Ratio’) 


2.44 


2.24 


2.07 


Ratio 


1.8 4 


1.64 


20 40 60 80 100 
Iteration 


o4 


If we look at the last portion of the above computation, we see that the random vector is 
stretched by a factor of 1.974459321485[...], where the portion at the end shifts a little, 
but the stretching factor is stable. 
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Relating Back to Eigenvectors 


We have seen that eigenvectors and eigenvalues correspond to the amount something is 
stretched, but that was for specific vectors, and specific stretches. Let’s take a look at what 
they are for A. A bit of a caveat here: it turns out that to see them all, we will need to go 
to complex numbers. You can think of these as stretches and rotations. By taking the norm 
of the complex number (square root of the sums of squares of real and imaginary parts) we 
can measure that stretching factor. Let’s also sort them. 


# Compute the eigenvalues 

eigs = torch. linalg.eig(A) .eigenvalues. tolist() 
norm_eigs = [torch.abs(torch.tensor(x)) for x in eigs] 
norm_eigs.sort() 

print(f'norms of eigenvalues: {norm_eigs}’) 


norms of eigenvalues: [tensor(@.3490), tensor(1.1296), tensor(1.1296),_ 
—tensor(1.1828), tensor(2.4532)] 


An Observation 


We see something a bit unexpected happening here: that number we identified before for the 
long term stretching of our matrix A applied to a random vector is exactly (accurate to thir- 
teen decimal places!) the largest eigenvalue of A. This is clearly not a coincidence! 


But, if we now think about what is happening geometrically, this starts to make sense. Con- 
sider a random vector. This random vector points a little in every direction, so in particular, 
it points at least a little bit in the same direction as the eigenvector of A associated with 
the largest eigenvalue. This is so important that it is called the principle eigenvalue and 
principle eigenvector. After applying A, our random vector gets stretched in every possi- 
ble direction, as is associated with every possible eigenvector, but it is stretched most of 
all in the direction associated with this principle eigenvector. What this means is that after 
apply in A, our random vector is longer, and points in a direction closer to being aligned 
with the principle eigenvector. After applying the matrix many times, the alignment with 
the principle eigenvector becomes closer and closer until, for all practical purposes, our 
random vector has been transformed into the principle eigenvector! Indeed this algorithm 
is the basis for what is known as the power iteration for finding the largest eigenvalue and 
eigenvector of a matrix. For details see, for example, (Golub and Van Loan, 1996). 


Fixing the Normalization 


Now, from above discussions, we concluded that we do not want a random vector to be 
stretched or squished at all, we would like random vectors to stay about the same size 
throughout the entire process. To do so, we now rescale our matrix by this principle eigen- 
value so that the largest eigenvalue is instead now just one. Let’s see what happens in this 
case. 
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# Rescale the matrix ‘A* 
A /= norm_eigs[-1] 


# Do the same experiment again 
v_in = torch.randn(k, 1, dtype=torch. float64) 


norm_list = [torch.norm(v_in).item() ] 

for i in range(1, 100): 
v_in = A @ v_in 
norm_list.append(torch.norm(v_in) .item()) 


d21.plot(torch.arange(@, 100), norm_list, ‘Iteration’, 'Value’) 


40 60 80 100 
Iteration 


o-4 
N 
oO 


We can also plot the ratio between consecutive norms as before and see that indeed it sta- 
bilizes. 


# Also plot the ratio 

norm_ratio_list = [] 

for i in range(1, 100): 
norm_ratio_list.append(norm_listLli]/norm_list[i-1]) 


d21.plot(torch.arange(1, 100), norm_ratio_list, ‘Iteration’, 'Ratio’) 


1.05 5 


1.00 5 


0.95 5 


Ratio 


0.90 4 


0.85 4 


40 60 80 100 
Iteration 


o4 
N 
oO 


A.2.7 Discussion 


We now see exactly what we hoped for! After normalizing the matrices by the principal 
eigenvalue, we see that the random data does not explode as before, but rather eventually 
equilibrates to a specific value. It would be nice to be able to do these things from first 
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principles, and it turns out that if we look deeply at the mathematics of it, we can see that 
the largest eigenvalue of a large random matrix with independent mean zero, variance one 
Gaussian entries is on average about y7, or in our case V5 ~ 2.2, due to a fascinating fact 
known as the circular law (Ginibre, 1965). The relationship between the eigenvalues (and 
a related object called singular values) of random matrices has been shown to have deep 
connections to proper initialization of neural networks as was discussed in Pennington et 
al. (2017) and subsequent works. 


A.2.8 Summary 


e Eigenvectors are vectors which are stretched by a matrix without changing direction. 


e Eigenvalues are the amount that the eigenvectors are stretched by the application of the 
matrix. 


e The eigendecomposition of a matrix can allow for many operations to be reduced to 
operations on the eigenvalues. 


e The Gershgorin Circle Theorem can provide approximate values for the eigenvalues of 
a matrix. 


e The behavior of iterated matrix powers depends primarily on the size of the largest eigen- 
value. This understanding has many applications in the theory of neural network ini- 
tialization. 


A.2.9 Exercises 


1. What are the eigenvalues and eigenvectors of 


A= f A (A.21) 


2. What are the eigenvalues and eigenvectors of the following matrix, and what is strange 
about this example compared to the previous one? 


Sl 9 


z | l (4.22) 


3. Without computing the eigenvalues, is it possible that the smallest eigenvalue of the 
following matrix is less that 0.5? Note: this problem can be done in your head. 


3.0 0.1 0.3 1.0 
0.1 10 0.1 0.2 
ae 0.3 0.1 5.0 0.0] 28) 


1.0 0.2 0.0 1.8 


Discussions 2*! . 
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A.3 Single Variable Calculus 
—————_—— asl 


In Section 2.4, we saw the basic elements of differential calculus. This section takes a 
deeper dive into the fundamentals of calculus and how we can understand and apply it in 
the context of machine learning. 


A.3.1 Differential Calculus 


Differential calculus is fundamentally the study of how functions behave under small changes. 
To see why this is so core to deep learning, let’s consider an example. 


Suppose that we have a deep neural network where the weights are, for convenience, con- 
catenated into a single vector w = (w,...,W,). Given a training dataset, we consider the 
loss of our neural network on this dataset, which we will write as £L(w). 


This function is extraordinarily complex, encoding the performance of all possible models 
of the given architecture on this dataset, so it is nearly impossible to tell what set of weights 
w will minimize the loss. Thus, in practice, we often start by initializing our weights ran- 
domly, and then iteratively take small steps in the direction which makes the loss decrease 
as rapidly as possible. 


The question then becomes something that on the surface is no easier: how do we find 
the direction which makes the weights decrease as quickly as possible? To dig into this, 
let’s first examine the case with only a single weight: L(w) = L(x) for a single real value 
x. 


Let’s take x and try to understand what happens when we change it by a small amount to 
x + €. If you wish to be concrete, think a number like e = 0.0000001. To help us visualize 
what happens, let’s graph an example function, f(x) = sin(x*), over the [0, 3]. 


%matplotlib inline 

import torch 

from IPython import display 
from d21 import torch as d21 


torch.pi = torch.acos(torch.zeros(1)).item() x 2 # Define pi in torch 


# Plot a function in a normal range 
x_big = torch.arange(0.01, 3.01, 0.01) 
ys = torch.sin(x_big**x_big) 
d21.plot(x_big, ys, 'x', 'f(x)’) 


At this large scale, the function’s behavior is not simple. However, if we reduce our range to 
something smaller like [1.75, 2.25], we see that the graph becomes much simpler. 


# Plot a the same function in a tiny range 
x_med = torch.arange(1.75, 2.25, 0.001) 


(continues on next page) 
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1.05 


0.54 


F(x) 


0.04 


-0.54 


-1.04 


o-4 
m 
N 
we 


(continued from previous page) 


ys = torch.sin(x_med**x_med) 
d21.plot(x_med, ys, 'x', ‘f(x)') 


0.54 


0.05 


f(x) 


-0.54 


-1.04 


Taking this to an extreme, if we zoom into a tiny segment, the behavior becomes far simpler: 
it is just a straight line. 


# Plot a the same function in a tiny range 
x_small = torch.arange(2.0, 2.01, 0.0001) 
ys = torch.sin(x_small**x_smal1) 
d21.plot(x_small, ys, ‘'x', ‘f(x)') 


—0.76 5 
—0.77 7 
x 
= -0.78 5 


—0.79 4 


—0.80 4 
2.000 2.002 2.004 2.006 2.008 2.010 
x 


This is the key observation of single variable calculus: the behavior of familiar functions 
can be modeled by a line in a small enough range. This means that for most functions, it 
is reasonable to expect that as we shift the x value of the function by a little bit, the output 
f(x) will also be shifted by a little bit. The only question we need to answer is, “How large 
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is the change in the output compared to the change in the input? Is it half as large? Twice 
as large?” 


Thus, we can consider the ratio of the change in the output of a function for a small change 
in the input of the function. We can write this formally as 


L(x+e)- L(x) _ L(x +e) — L(x) 
(x+e)-x € i 


(A.1) 


This is already enough to start to play around with in code. For instance, suppose that we 
know that L(x) = x? + 1701(x — 4)3, then we can see how large this value is at the point 
x = 4 as follows. 


# Define our function 
def L(x): 
return xx*2 + 1701*(x-4)*x3 


# Print the difference divided by epsilon for several epsilon 
for epsilon in [@.1, 0.001, 0.0001, 0.00001]: 
print(f'’epsilon = {epsilon: .5f} -> {(L(4tepsilon) - L(4)) / epsilon: .5f}') 


epsilon = 0.10000 -> 25.11000 
epsilon = 0.00100 -> 8.00270 
epsilon = 0.00010 -> 8.00012 
epsilon = 0.00001 -> 8.00001 


Now, if we are observant, we will notice that the output of this number is suspiciously close 
to 8. Indeed, if we decrease e, we will see value becomes progressively closer to 8. Thus we 
may conclude, correctly, that the value we seek (the degree a change in the input changes 
the output) should be 8 at the point x = 4. The way that a mathematician encodes this fact 
is 
. L(4+6)-L(4) 
lim ———————. = 
e>0 € 


8. (A.2) 


As a bit of a historical digression: in the first few decades of neural network research, sci- 
entists used this algorithm (the method of finite differences) to evaluate how a loss function 
changed under small perturbation: just change the weights and see how the loss changed. 
This is computationally inefficient, requiring two evaluations of the loss function to see how 
a single change of one variable influenced the loss. If we tried to do this with even a pal- 
try few thousand parameters, it would require several thousand evaluations of the network 
over the entire dataset! It was not solved until 1986 that the backpropagation algorithm 
introduced in Rumelhart et al. (1988) provided a way to calculate how any change of the 
weights together would change the loss in the same computation time as a single prediction 
of the network over the dataset. 


Back in our example, this value 8 is different for different values of x, so it makes sense to 
define it as a function of x. More formally, this value dependent rate of change is referred 
to as the derivative which is written as 
d +€)- 
af = lim Let 9- FH) (A.3) 
€ 


dx e-0 
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Different texts will use different notations for the derivative. For instance, all of the below 
notations indicate the same thing: 


Po pap aVef = Def = fe (A.4) 
x è dx 

Most authors will pick a single notation and stick with it, however even that is not guaran- 
teed. Itis best to be familiar with all of these. We will use the notation af throughout this 
text, unless we want to take the derivative of a complex expression, in which case we will 


use 4, f to write expressions like 
4 x? +1 
x” + cos 


2x -1 


dx 


(A.5) 


Oftentimes, it is intuitively useful to unravel the definition of derivative (A.3) again to see 
how a function changes when we make a small change of x: 


df fa+9-f@) _, Fy, fate- FO) 
E€ €E 


ae = im dx 
df 
= ego x f(x+e)— f(x) (A.6) 


=> f(xte)& f(x) peels. 
dx 


(x) 


The last equation is worth explicitly calling out. It tells us that if you take any function and 
change the input by a small amount, the output would change by that small amount scaled 
by the derivative. 


In this way, we can understand the derivative as the scaling factor that tells us how large of 
change we get in the output from a change in the input. 


A.3.2 Rules of Calculus 


We now turn to the task of understanding how to compute the derivative of an explicit 
function. A full formal treatment of calculus would derive everything from first principles. 
We will not indulge in this temptation here, but rather provide an understanding of the 
common rules encountered. 


Common Derivatives 


As was seen in Section 2.4, when computing derivatives one can oftentimes use a series of 
rules to reduce the computation to a few core functions. We repeat them here for ease of 
reference. 


e Derivative of constants. te =0. 
e Derivative of linear functions. # (ax) =a. 


e Power rule. ayn =nx"!, 


Derivative of exponentials. Ler =e*, 


Derivative of the logarithm. Æ log(x) = +. 


x 
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Derivative Rules 


If every derivative needed to be separately computed and stored in a table, differential cal- 
culus would be near impossible. It is a gift of mathematics that we can generalize the 
above derivatives and compute more complex derivatives like finding the derivative of 
f(x) = log (1+ (x-1)!°). As was mentioned in Section 2.4, the key to doing so is to 
codify what happens when we take functions and combine them in various ways, most 
importantly: sums, products, and compositions. 


e Sum rule. 4 (g(x) + h(x)) = a8 (x) + B(x). 
o Product rule. & (g(x) - h(x)) = g(x) F(x) + 98 (x) h(x). 
e Chain rule. “g(h(x)) = EAE) - 4Q). 


Let’s see how we may use (A.6) to understand these rules. For the sum rule, consider 
following chain of reasoning: 


f(x+e)=g(x+e)+h(x+e€) 
x g(x) + E(x) + h(x) + Fw) 


= a(x) +h(s) +e (Ei) + 200] (A.7) 
dx dx 


- f(s) +e (Bey + Zea), 


By comparing this result with the fact that f(x+e) ~ f(x) + e£ (x), we see that ar (x)= 
E (x)+ gk (x) as desired. The intuition here is: when we change the input x, g and / jointly 
contribute to the change of the output by ae (x) and gh (x). 


The product is more subtle, and will require a new observation about how to work with 
these expressions. We will begin as before using (A.6): 


f(xt+e) =g(xt+e)-h(xte) 


= (e009 +e) aw] 


A.8 
= ets) ha) +e(s9 Hay Foro) eH Ben iat 


= £0) + (eo Le) Forw) Lo Hoo. 


This resembles the computation done above, and indeed we see our answer (4£(x) = 
g(x) A(x) + 28 (x) h(x)) sitting next to €, but there is the issue of that term of size €7. 
We will refer to this as a higher-order term, since the power of e? is higher than the power 
of e!. We will see in a later section that we will sometimes want to keep track of these, 
however for now observe that if € = 0.0000001, then €? = 0.0000000000001, which is 
vastly smaller. As we send € — 0, we may safely ignore the higher order terms. As a 


general convention in this appendix, we will use “=~” to denote that the two terms are equal 
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up to higher order terms. However, if we wish to be more formal we may examine the 
difference quotient 


POF I LO _ Zot FwnnrefoLw, aa 


and see that as we send e — 0, the right hand term goes to zero as well. 
Finally, with the chain rule, we can again progress as before using (A.6) and see that 
f(x + €) = g(h(x + €)) 


= « [rto] 
5 (A.10) 
= g(h(x)) +e) (HG) | 


= flax) +E (ny) ZO, 


where in the second line we view the function g as having its input (A(x)) shifted by the 
tiny quantity ef (x). 


These rule provide us with a flexible set of tools to compute essentially any expression 
desired. For instance, 


< [log (1 (KS iy] 2 (1 Ge pi < [1+ œ- 1)"] 


= Gaga 


5 (1 +(x- pi [o+ 10(x - eae = n) (A.11) 


10 (14 (= 1) "w= 


10(x — 1)° 
1+(x-1)10° 


Where each line has used the following rules: 

1. The chain rule and derivative of logarithm. 

2. The sum rule. 

3. The derivative of constants, chain rule, and power rule. 

4. The sum rule, derivative of linear functions, derivative of constants. 
Two things should be clear after doing this example: 


1. Any function we can write down using sums, products, constants, powers, exponentials, 
and logarithms can have its derivate computed mechanically by following these rules. 


2. Having a human follow these rules can be tedious and error prone! 


Thankfully, these two facts together hint towards a way forward: this is a perfect candidate 
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for mechanization! Indeed backpropagation, which we will revisit later in this section, is 
exactly that. 


Linear Approximation 


When working with derivatives, it is often useful to geometrically interpret the approxima- 
tion used above. In particular, note that the equation 


f(xte)% More Os, (A.12) 
dx 


approximates the value of f by a line which passes through the point (x, f(x)) and has 
slope IÊ (x). In this way we say that the derivative gives a linear approximation to the 
function f, as illustrated below: 


# Compute sin 
xs = torch.arange(-torch.pi, torch.pi, 2.01) 
plots = [torch.sin(xs)] 


# Compute some linear approximations. Use d(sin(x))/dx = cos(x) 
ole xO a ES Ose, oils 
plots.append(torch.sin(torch.tensor(x@)) + (xs - x@) * 
torch.cos(torch. tensor (xQ))) 


d21.plot(xs, plots, 'x', ‘'f(x)’, ylim=[-1.5, 1.5]) 


f(x) 


Higher Order Derivatives 


Let’s now do something that may on the surface seem strange. Take a function f and 


compute the derivative a This gives us the rate of change of f at any point. 


sch gid caf i yaa . 
However, the derivative, I, can be viewed as a function itself, so nothing stops us from 


dx? ~ dx 
tive of f. This function is the rate of change of the rate of change of f, or in other words, 
how the rate of change is changing. We may apply the derivative any number of times to 
obtain what is called the n-th derivative. To keep the notation clean, we will denote the 


computing the derivative of ar to get adi (4 l We will call this the second deriva- 
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n-th derivative as 


my of (4) 
fry =F -Í 2) F. (4.13) 


Let’s try to understand why this is a useful notion. Below, we visualize f(x), f(x), 


and f(x). 


First, consider the case that the second derivative f” (x) is a positive constant. This means 
that the slope of the first derivative is positive. As a result, the first derivative f (x) may 
start out negative, becomes zero at a point, and then becomes positive in the end. This tells 
us the slope of our original function f and therefore, the function f itself decreases, flattens 
out, then increases. In other words, the function f curves up, and has a single minimum as 
is shown in Fig. A.1. 


== — = AH 
fe) f(x) F(x) 


If we assume the second derivative is a positive constant, then the fist derivative in 
increasing, which implies the function itself has a minimum. 


Second, if the second derivative is a negative constant, that means that the first derivative 
is decreasing. This implies the first derivative may start out positive, becomes zero at a 
point, and then becomes negative. Hence, the function f itself increases, flattens out, then 
decreases. In other words, the function f curves down, and has a single maximum as is 
shown in Fig. A.2. 


f°) Fœ) fœ 


If we assume the second derivative is a negative constant, then the fist derivative in 
decreasing, which implies the function itself has a maximum. 


Third, if the second derivative is a always zero, then the first derivative will never change— 
it is constant! This means that f increases (or decreases) at a fixed rate, and f is itself a 
straight line as is shown in Fig. A.3. 


To summarize, the second derivative can be interpreted as describing the way that the func- 
tion f curves. A positive second derivative leads to a upwards curve, while a negative sec- 
ond derivative means that f curves downwards, and a zero second derivative means that f 
does not curve at all. 
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f°) 


F% fœ% 


If we assume the second derivative is zero, then the fist derivative is constant, which 
implies the function itself is a straight line. 


Let’s take this one step further. Consider the function g(x) = ax? + bx +c. We can then 


compute that 


Ea = 2ax +b 

x 

2 (A.14) 
8 (x) = 2a 

dx? 


If we have some original function f(x) in mind, we may compute the first two derivatives 
and find the values for a, b, and c that make them match this computation. Similarly to 
the previous section where we saw that the first derivative gave the best approximation 
with a straight line, this construction provides the best approximation by a quadratic. Let’s 


visualize this for f(x) = sin(x). 


# Compute sin 


xs = torch.arange(-torch.pi, torch.pi, 2.01) 


plots = [torch.sin(xs)] 


# Compute some quadratic approximations. Use d(sin(x)) / dx = cos(x) 


POr 3@ xf) [k=l OG, Pols 


plots.append(torch.sin(torch.tensor(x®)) + (xs - x@) * 


torch.cos(torch.tensor(x@)) - 


(xs - x@)**2 * 


torch.sin(torch.tensor(x®@)) / 2) 


d21.plot(xs, plots, ‘x’, 'f(x)', ylim=[-1.5, 1.5]) 


F(x) 


We will extend this idea to the idea of a Taylor series in the next section. 
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Taylor Series 


The Taylor series provides a method to approximate the function f (x) if we are given values 
for the first n derivatives at a point xo, i.e., {f (xo), fA (xo), f” (x0), .--5 f™ (xo)}. The 
idea will be to find a degree n polynomial that matches all the given derivatives at x. 


We saw the case of n = 2 in the previous section and a little algebra shows this is 


1d d 
FO) = EEE (9) — xo)? + Z o) (x — x0) + f0). (A.15) 
2 dx dx 
As we can see above, the denominator of 2 is there to cancel out the 2 we get when we take 
two derivatives of x?, while the other terms are all zero. Same logic applies for the first 


derivative and the value itself. 

If we push the logic further to n = 3, we will conclude that 

d 

TL (x0) 
6 


where the 6 = 3x2 = 3! comes from the constant we get in front if we take three derivatives 
3 
of x. 


(x -= x0)? + (A.16) 


f(x) ® 


LS (x0) df 
(a — x0)? + Fo) (x — xo) + f (x0). 
X 


Furthermore, we can get a degree n polynomial by 


n (i) i 
P(x) = >, PD -= xo)". (A.17) 
i=0 : 
where the notation 
d” d\" 
f(a) = SE (=) F (A.18) 


Indeed, P,,(x) can be viewed as the best n-th degree polynomial approximation to our func- 


tion f(x). 


While we are not going to dive all the way into the error of the above approximations, it 
is worth mentioning the infinite limit. In this case, for well behaved functions (known as 
real analytic functions) like cos(x) or e*, we can write out the infinite number of terms and 
approximate the exactly same function 


œ p(n) 
jw, L og - xo)”. (A.19) 


n=0 


Take f(x) = e* as am example. Since e* is its own derivative, we know that f™ (x) = e*. 
Therefore, e* can be reconstructed by taking the Taylor series at xp = 0, i.e., 


OOF 2 3 


X X X 
a= — =l+x+ >+ +. 
e a x+5 (A.20) 


Let’s see how this works in code and observe how increasing the degree of the Taylor 
approximation brings us closer to the desired function e*. 
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# Compute the exponential function 
xs = torch.arange(@, 3, 0.01) 
ys = torch.exp(xs) 


# Compute a few Taylor series approximations 

Pl = 1+ xs 

P2 = 1+ xs + xsxx2 / 2 

P5 = 1 + xs + xsx*2 / 2 + xsxx3 / 6 + xsxx4 / 24 + xsxx5 / 120 


d21.plot(xs, Lys, P1, P2, P5], ‘x’, ‘f(x)', legend=[ 
"Exponential", "Degree 1 Taylor Series”, "Degree 2 Taylor Series”, 
"Degree 5 Taylor Series” ]) 


204 = 
— Exponential 
=-=- Degree 1 Taylor Series 
157 —.- Degree 2 Taylor Series 
ox, paa Degree 5 Taylor Series fe 
z f 
= 


Taylor series have two primary applications: 


1. Theoretical applications: Often when we try to understand a too complex function, 
using Taylor series enables us to turn it into a polynomial that we can work with directly. 


2. Numerical applications: Some functions like e* or cos(x) are difficult for machines to 
compute. They can store tables of values at a fixed precision (and this is often done), but 
it still leaves open questions like “What is the 1000-th digit of cos(1)?” Taylor series 
are often helpful to answer such questions. 


A.3.3 Summary 


Derivatives can be used to express how functions change when we change the input by a 
small amount. 


Elementary derivatives can be combined using derivative rules to create arbitrarily com- 
plex derivatives. 


Derivatives can be iterated to get second or higher order derivatives. Each increase in 
order provides more fine grained information on the behavior of the function. 


Using information in the derivatives of a single data example, we can approximate well 
behaved functions by polynomials obtained from the Taylor series. 


A.3.4 Exercises 


1. What is the derivative of x? — 4x + 1? 
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2. What is the derivative of log( +)? 
3. True or False: If f'(x) = 0 then f has a maximum or minimum at x? 


4. Where is the minimum of f(x) = x log(x) for x > 0 (where we assume that f takes the 
limiting value of 0 at f(0))? 


: + 282 
282 Discussions ~“°*. 


A.4 Multivariable Calculus 
E 


Now that we have a fairly strong understanding of derivatives of a function of a single 
variable, let’s return to our original question where we were considering a loss function of 
potentially billions of weights. 


A.4.1 Higher-Dimensional Differentiation 


What Section A.3 tells us is that if we change a single one of these billions of weights 
leaving every other one fixed, we know what will happen! This is nothing more than a 
function of a single variable, so we can write 


d 
L(w, + €1, W2,...,Ww) © L(w, W2,...,Wn) + Ezy, LLW ... wN). (A.1) 
1 


We will call the derivative in one variable while fixing the other variables the partial deriva- 
tive, and we will use the notation ae for the derivative in (A.1). 


Now, let’s take this and change w3 a little bit to w2 + €9: 


0 
L(w, +€1,W2+©@,...,wn) © L(w1, w2 +62,...,WN) tery _L(w1, Wa + €2)-.., WN + En) 
1 
x L(w1,W2,..., WN) 
0 
+ €—L(w,W2, eine »wn) 
Ow? 
0 
+€ ——L(w,W2,...,WN) 
Ow, 
re) 
Pepe > —L(Wis W955 0g WN) 
Ow2 Ow, 
x L(w1,W2,..., WN) 
0 
+ &——L(w1,W2,. . . Wy) 
Ow2 
0 
a €; ——L(w, w2, es Wy). 
Ow, 
(A.2) 


We have again used the idea that €; €2 is a higher order term that we can discard in the same 
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way we could discard e° in the previous section, along with what we saw in (A.1). By 
continuing in this manner, we may write that 


ð 
L(wi +61,wW2+62,...,wWyN ten) © L(w1,w2,..., WN) + ) Eizy LWL W2.. WN). 
7 i 
t 


(A.3) 


This may look like a mess, but we can make this more familiar by noting that the sum on 
the right looks exactly like a dot product, so if we let 


= 
e=[e,...,en]' and V,L= OP (A.4) 
Ox, OxN 
then 
L(w+e) = L(w)+e€-VyL(w). (A.5) 


We will call the vector V,,L the gradient of L. 


Equation (A.5) is worth pondering for a moment. It has exactly the format that we encoun- 
tered in one dimension, just we have converted everything to vectors and dot products. 
It allows us to tell approximately how the function L will change given any perturbation 
to the input. As we will see in the next section, this will provide us with an important 
tool in understanding geometrically how we can learn using information contained in the 
gradient. 


But first, let’s see this approximation at work with an example. Suppose that we are working 
with the function 


x y 
f(x,y) = log(e* + e”) with gradient Y f(x, y) = | ——, —— |. (A.6) 
ex+ey e+e 
If we look at a point like (0, log(2)), we see that 
1 2 
f(x,y) = log(3) with gradient V f(x, y) = E 5 ; (A.7) 


Thus, if we want to approximate f at (€),log(2) + €2), we see that we should have the 
specific instance of (A.5): 


f(e1, log(2) + €2) x log(3) + a + 5 (A.8) 


We can test this in code to see how good the approximation is. 


zmatplotlib inline 

import numpy as np 

import torch 

from IPython import display 

from mpl_toolkits import mplot3d 
from d21 import torch as d21 


def f(x, y): 


(continues on next page) 
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(continued from previous page) 


return torch. log(torch.exp(x) + torch.exp(y)) 
def grad_f(x, y): 
return torch.tensor([torch.exp(x) / (torch.exp(x) + torch.exp(y)), 
torch.exp(y) / (torch.exp(x) + torch.exp(y))]) 


epsilon = torch.tensor([@.01, -@.@3]) 

grad_approx = f(torch.tensor([@.]), torch. log( 
torch. tensor([2.]))) + epsilon. dot( 
grad_f(torch.tensor([@.]), torch. log(torch.tensor(2.)))) 

true_value = f(torch.tensor(L@.]) + epsilonl@], torch. log( 
torch. tensor([2.])) + epsilon[1]) 

f’approximation: {grad_approx}, true Value: {true_value}’ 


‘approximation: tensor([1.0819]), true Value: tensor([1.0821])’ 


A.4.2 Geometry of Gradients and Gradient Descent 


Consider the expression from (A.5) again: 
L(w+e) x L(w) +€: VwL(w). (A.9) 


Let’s suppose that I want to use this to help minimize our loss L. Let’s understand geomet- 
rically the algorithm of gradient descent first described in Section 2.5. What we will do is 
the following: 


1. Start with a random choice for the initial parameters w. 

2. Find the direction v that makes L decrease the most rapidly at w. 
3. Take a small step in that direction: w > w + ev. 

4. Repeat. 


The only thing we do not know exactly how to do is to compute the vector v in the second 
step. We will call such a direction the direction of steepest descent. Using the geometric 
understanding of dot products from Section A.1, we see that we can rewrite (A.5) as 


L(w +v) = L(w) +v: VwL(w) = L(w) + ||VwL(w)|| cos(8). (A.10) 


Note that we have taken our direction to have length one for convenience, and used 6 for 
the angle between v and V,,L(w). If we want to find the direction that decreases L as 
rapidly as possible, we want to make this expression as negative as possible. The only way 
the direction we pick enters into this equation is through cos(@), and thus we wish to make 
this cosine as negative as possible. Now, recalling the shape of cosine, we can make this as 
negative as possible by making cos(@) = —1 or equivalently making the angle between the 
gradient and our chosen direction to be 7 radians, or equivalently 180 degrees. The only 
way to achieve this is to head in the exact opposite direction: pick v to point in the exact 
opposite direction to VwL(w)! 


This brings us to one of the most important mathematical concepts in machine learning: 
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the direction of steepest decent points in the direction of -V,L(w). Thus our informal 
algorithm can be rewritten as follows. 


1. Start with a random choice for the initial parameters w. 

2. Compute Vy L(w). 

3. Take a small step in the opposite of that direction: w — w — eVyL(w). 
4. Repeat. 


This basic algorithm has been modified and adapted many ways by many researchers, but 
the core concept remains the same in all of them. Use the gradient to find the direction that 
decreases the loss as rapidly as possible, and update the parameters to take a step in that 
direction. 


A.4.3 A Note on Mathematical Optimization 


Throughout this book, we focus squarely on numerical optimization techniques for the prac- 
tical reason that all functions we encounter in the deep learning setting are too complex to 
minimize explicitly. 


However, it is a useful exercise to consider what the geometric understanding we obtained 
above tells us about optimizing functions directly. 


Suppose that we wish to find the value of xọ which minimizes some function L(x). Let’s 
suppose that moreover someone gives us a value and tells us that it is the value that mini- 
mizes L. Is there anything we can check to see if their answer is even plausible? 


Again consider (A.5): 
L(xo + €) ~ L(xo) + € : VxL(x0). (A.11) 


If the gradient is not zero, we know that we can take a step in the direction —eV,,L(xo) to 
find a value of L that is smaller. Thus, if we truly are at a minimum, this cannot be the 
case! We can conclude that if x9 is a minimum, then V,,.L(x9) = 0. We call points with 
VL (xo) = 0 critical points. 


This is nice, because in some rare settings, we can explicitly find all the points where the 
gradient is zero, and find the one with the smallest value. 


For a concrete example, consider the function 


f(x) = 3x4 — 4x3 — 12x. (A.12) 

This function has derivative 
a = 12x? — 12x? — 24x = 12x(x — 2)(x + 1). (A.13) 
The only possible location of minima are at x = —1,0,2, where the function takes the 


values —5,0, —32 respectively, and thus we can conclude that we minimize our function 
when x = 2. A quick plot confirms this. 
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x = torch.arange(-2, 3, 0.01) 
f = (3 * xx*x4) - (4 * x%*3) - (12 * x**2) 


Pal EC Fy Ox 5 FED 


This highlights an important fact to know when working either theoretically or numerically: 
the only possible points where we can minimize (or maximize) a function will have gradient 
equal to zero, however, not every point with gradient zero is the true global minimum (or 
maximum). 


A.4.4 Multivariate Chain Rule 


Let’s suppose that we have a function of four variables (w, x, y, and z) which we can make 
by composing many terms: 


f(u, v) = (u+v)? 
u(a,b) = (a + b}, v(a, b) = (a - b}, (A.14) 
alw, x,y,z) =(w+x4+y4z)’, b(w, x,y,z) = (w+x-y-z}. 
Such chains of equations are common when working with neural networks, so trying to 
understand how to compute gradients of such functions is key. We can start to see visual 


hints of this connection in Fig. A.1 if we take a look at what variables directly relate to one 
another. 


The function relations above where nodes represent values and edges show functional 
dependence. 


Nothing stops us from just composing everything from (A.14) and writing out that 
2 2\? 2 2\" ° 
f(w,x,y,z) = (txt y+) +(wt+x-y-z) ) +((w+x+y+2) -(w+x-y-z) ) : 
(A.15) 
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We may then take the derivative by just using single variable derivatives, but if we did that 
we would quickly find ourself swamped with terms, many of which are repeats! Indeed, 
one can see that, for instance: 


OF -9(2(20w+xty+2)-2Uw+x-y—Q) ((wextyts?-(w+x-y-9")4 
ðw 


2(2(w+x-y-z)+2(w+x+y+z)) ((w+x-y-z)?+(w+x+y+z)?))x 


(wrrr d- (wiry?) + (wrr-y-9 + wrrr), 
(A.16) 


If we then also wanted to compute ÎE, we would end up with a similar equation again with 
many repeated terms, and many shared repeated terms between the two derivatives. This 
represents a massive quantity of wasted work, and if we needed to compute derivatives this 
way, the whole deep learning revolution would have stalled out before it began! 


Let’s break up the problem. We will start by trying to understand how f changes when we 
change a, essentially assuming that w, x, y, and z all do not exist. We will reason as we 
did back when we worked with the gradient for the first time. Let’s take a and add a small 
amount € to it. 


f(u(ate,b), v(a+e,b)) 
xf |u(a,b) + Bla b), v(a, b) + Plats b) 
Oa Oa 


=f (u(a, b), v(a, b)) +€ DS ies b), v(a, m in b)+ iTA b), v(a, DI ia b)|. 
Ou da Ov Oa 


(A.17) 


The first line follows from the definition of partial derivative, and the second follows from 
the definition of gradient. It is notationally burdensome to track exactly where we evaluate 
every derivative, as in the expression ôf (u(a, b), v(a, b)), so we often abbreviate this to 
the much more memorable 

Of _ 3f Gb _ Of Ov 

ða ðuða ðv ða 
It is useful to think about the meaning of the process. We are trying to understand how 
a function of the form f(u(a, b), v(a, b)) changes its value with a change in a. There 
are two pathways this can occur: there is the pathway where a — u — f and where 


(A.18) 


a — v — f. We can compute both of these contributions via the chain rule: ôw . ĝu and 
Ow , ðv : 
oy’ p Tespectively, and added up. 


Imagine we have a different network of functions where the functions on the right depend 
on those that are connected to on the left as is shown in Fig. A.2. 
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To compute something like L, we need to sum over all (in this case 3) paths from y to f 
giving 

ô Of ða ð Of ð Of ðb ð 

Tanl P LEA (A.19) 

Oy Oadudy Oudy ðb ðv ðy 
Understanding the chain rule in this way will pay great dividends when trying to understand 
how gradients flow through networks, and why various architectural choices like those in 
LSTMs (Section 10.1) or residual layers (Section 8.6) can help shape the learning process 
by controlling gradient flow. 


A.4.5 The Backpropagation Algorithm 


Let’s return to the example of (A.14) the previous section where 


f(u,v) = (u+ v)? 
u(a,b) = (a + b}, v(a, b) = (a - b}, (A.20) 


a(w, x,y,z) = (w +x+y+2z), b(w,x, y,z) = (w pry se). 


If we want to compute say oe we may apply the multi-variate chain rule to see: 


af _ af ou , af av 
ðw Oudw dv dw’ 
ðu Ou da ðu Ob 


aw dadw Ob Ow’ 
Ov _ ðv ða óv ðb 


ôw dadw” ðb ðw” 


(A.21) 


Let’s try using this decomposition to compute sf Notice that all we need here are the 
various single step partials: 


Of 


_ Of _ 
Ou ee Ay 72 +y) 
ou =2(a+5), ot = 2(a+b), 
P (A.22) 
Ja =2(a-b), Jb =—2(a—b), 
Of ESET © ie 
Ow Ow 


If we write this out into code this becomes a fairly manageable expression. 


# Compute the value of the function from inputs to outputs 
Wo X y: Z = 1, 0, 2.1 

a, b = (wt x + y + ADE (Wt X - y - Z)**2 

u, v = (a + b)**2, (a - b)**2 

f = (u + v)**2 

print(f’ F aie Wi (iy (Oy eal MS aD) 


# Compute the single step partials 
df_du, df_dv = 2x(u + v), 2*(u + v) 


(continues on next page) 
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du_da, du_db, dv_da, dv_db = 2*(a + b), 2x(a + b), 2*(a - b), -2x(a - b) 
da_dw, db_dw = 2x(w + x + y + z), 2*(w+ x - y - Z) 


# Compute the final result from inputs to outputs 

du_dw, dv_dw = du_da*da_dw + du_db*db_dw, dv_da*da_dw + dv_db*db_dw 
df_dw = df_du*xdu_dw + df_dv*dv_dw 

print(f'df/dw at {w}, {x}, {y}, {z} is {df_dw}’) 


f at -1, 0, -2, 1 is 1024 
df/dw at -1, ð, -2, 1 is -4096 


However, note that this still does not make it easy to compute something like ÎL. The 
reason for that is the way we chose to apply the chain rule. If we look at what we did above, 
we always kept Ow in the denominator when we could. In this way, we chose to apply the 
chain rule seeing how w changed every other variable. If that is what we wanted, this would 
be a good idea. However, think back to our motivation from deep learning: we want to see 
how every parameter changes the Joss. In essence, we want to apply the chain rule keeping 
Of in the numerator whenever we can! 


To be more explicit, note that we can write 


Of Of da ðf db 
ðw dadw db dw’ 
Of _ Of du Of Ov (A.23) 
ða Ouda ðv ða’ 
Of _ Of ðu | df dv 
ðb ðuðb dvdb 
Of Of Of Of of 


Note that this application of the chain rule has us explicitly compute 3°, 35. 37> ap and Fy- 
Nothing stops us from also including the equations: 


af _ ðf da , Af db 
ôx ôðaðx ðb ðx’ 
ðf _ðfða af db 
Ody ððaðy ðb ðy’ ae) 
af _ ðf da , Af db 
ðz Oadz ðb dz 


and then keeping track of how f changes when we change any node in the entire network. 


Let’s implement it. 


# Compute the value of the function from inputs to outputs 


WS Gy: Why v4 = il On See A 


an DE Wet xe ya a ADE (Wet XS Vz) ae? 


u, v = (a + b)**2, (a - b)**2 
f = (u + v)**2 
print(f’f at {w}, {x}, fy}, {z} 1 


Sete) 


# Compute the derivative using the decomposition above 


(continues on next page) 
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# First compute the single step partials 
df_du, df_dv = 2*(u + v), 2*(u + v) 
du_da, du_db, dv_da, dv_db = 2*(a + b), 2*(a + b), 2*(a - b), -2*(a - b) 


da_dw, db_dw = 2x(w + x + y + z), 2*(w+ x - y - 2) 
da_dx, db_dx = 2*(w + x + y + z), 2*(w +x - y - z) 
da_dy, db_dy = 2x(w + x + y + z), -2*(w + x - y - z) 
da_dz, db_dz = 2x(w + x + y + z), -2x(wt+ x - y - 2) 


# Now compute how f changes when we change any value from output to input 


df_da, df_db = df_duxdu_da + df_dv*dv_da, df_du*du_db + df_dv*dv_db 
df_dw, df_dx = df_daxda_dw + df_db*db_dw, df_daxda_dx + df_db*db_dx 
df_dy, df_dz = df_daxda_dy + df_db*db_dy, df_da*xda_dz + df_db*db_dz 


print(f'df/dw at {w}, {x}, {y}, {z} is {df_dw}’) 
Preinit Checdip/icaciten Wi E a Zor SO EE 
print(f'df/dy at {w}, {x}, {y}, {z} is {df_dy}’) 
print(f'df/dz at {w}, {x}, {y}, {z} is {df_dz}') 


f at -1, ð, -2, 1 is 1024 


df/dw at -1, ð, -2, 1 is -4096 
df/dx at -1, ð, -2, 1 is -4096 
df/dy at -1, ð, -2, 1 is -4096 
df/dz at -1, ð, -2, 1 is -4096 


The fact that we compute derivatives from f back towards the inputs rather than from the 
inputs forward to the outputs (as we did in the first code snippet above) is what gives this 
algorithm its name: backpropagation. Note that there are two steps: 1. Compute the value 
of the function, and the single step partials from front to back. While not done above, this 
can be combined into a single forward pass. 2. Compute the gradient of f from back to 
front. We call this the backwards pass. 


This is precisely what every deep learning algorithm implements to allow the computation 
of the gradient of the loss with respect to every weight in the network at one pass. It is an 
astonishing fact that we have such a decomposition. 


To see how to encapsulated this, let’s take a quick look at this example. 


Initialize as ndarrays, then attach gradients 

= torch.tensor([-1.], requires_grad=True) 

torch. tensor([@.], requires_grad=True) 

= torch.tensor([-2.], requires_grad=True) 

torch. tensor([1.], requires_grad=True) 

Do the computation like usual, tracking gradients 
b= (wt x + y + z)**2, (Wt X - y - z)**2 

v = (a + b)**2, (a - b)**2 

= (u + v)**2 


I 


HECO PENS XK FE H 
lI 


# Execute backward pass 
f.backward() 


print(f'df/dw at {w.data.item()}, {x.data.item()}, {y.data.item()}, ' 
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f'{z.data.item()} is {w.grad.data.item() }’) 

print (f'df/dx at {w.data.item()}, {x.data.item()}, {y.data.item()}, ' 
f'{z.data.item()} is {x.grad.data.item() }’) 

print(f'df/dy at {w.data.item()}, {x.data.item()}, {y.data.item()}, ' 
f'{z.data.item()} is {y.grad.data.item() }’) 

print (f'df/dz at {w.data.item()}, {x.data.item()}, {y.data.item()}, ' 
f'{z.data.item()} is {z.grad.data.item() }’) 


df/dw at -1.0, 0.0, -2.0, 1.0 is -4096.0 
df/dx at -1.0, 0.0, -2.0, 1.0 is -4096.0 
df/dy at -1.0, 0.0, -2.0, 1.0 is -4096.0 
df/dz at -1.0, 0.0, -2.0, 1.0 is -4096.0 


All of what we did above can be done automatically by calling f .backwards(). 


A.4.6 Hessians 


As with single variable calculus, it is useful to consider higher-order derivatives in order 
to get a handle on how we can obtain a better approximation to a function than using the 
gradient alone. 


There is one immediate problem one encounters when working with higher order deriva- 
tives of functions of several variables, and that is there are a large number of them. If we 
have a function f(x1,...,x,) of n variables, then we can take n? many second derivatives, 
namely for any choice of i and j: 


af d(d 
= ; A.2 
dx; dx ; dx; (= r| ( 5) 


This is traditionally assembled into a matrix called the Hessian: 


Ef... Ef 
dx\dx, dx\dXxn 
ff | _@t 
dxndx dxndxn 


Not every entry of this matrix is independent. Indeed, we can show that as long as both 
mixed partials (partial derivatives with respect to more than one variable) exist and are 
continuous, we can say that for any i, and j, 


2 2 
A (A.27) 
dx; dx; dx ; dx; 


This follows by considering first perturbing a function in the direction of x;, and then per- 
turbing it in x; and then comparing the result of that with what happens if we perturb first 
x; and then x;, with the knowledge that both of these orders lead to the same final change 
in the output of f. 


As with single variables, we can use these derivatives to get a far better idea of how the 
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function behaves near a point. In particular, we can use it to find the best fitting quadratic 
near a point xo, as we saw in a single variable. 


Let’s see an example. Suppose that f(x1,x2) = a+ b,x; + box2 + cux? + C12X1X2 + C22X5. 
This is the general form for a quadratic in two variables. If we look at the value of the 
function, its gradient, and its Hessian (A.26), all at the point zero: 


f(0,0) =a, 
_ |b 
ARE ia (A.28) 
_ |2cn cn 
Hf (0,0) = | t n] ; 


we can get our original polynomial back by saying 


f(x) = f(0) + VF(O)-x+ SxH/(O)x. (A.29) 


In general, if we computed this expansion any point xo, we see that 


f(x) = f(%0) + V f (x0) : (x = xo) + 5x - xo) "Hf (xo)(x — Xo). (A.30) 


This works for any dimensional input, and provides the best approximating quadratic to any 
function at a point. To give an example, let’s plot the function 


f(x,y) re, (A.31) 


One can compute that the gradient and Hessian are 


a _ 2 3 SaR Qs 
epera e and Hp) =e (ME TD). 


4x?y -2y Ay? = 2x 
(A.32) 


And thus, with a little algebra, see that the approximating quadratic at [-1,0]" is 


f(y) xen} (-1-@ +1) ++I? +y’). (A.33) 


# Construct grid and compute function 
x, y = torch.meshgrid(torch.linspace(-2, 2, 101), 
torch. linspace(-2, 2, 101)) 


z = xxtorch.exp(- x**2 - y**2) 


# Compute approximating quadratic with gradient and Hessian at (1, 0) 
w = torch.exp(torch.tensor([-1.]))*(-1 - (x + 1) + 2 x (x + 1)**2 + 2 x yxx2) 


# Plot function 
ax = d21.plt.figure() .add_subplot(111, projection='3d') 
ax.plot_wireframe(x.numpy(), y.numpy(), z.numpy(), 

*k{'rstride’: 10, ‘'cstride’: 10}) 
ax.plot_wireframe(x.numpy(), y.numpy(), w.numpy(), 

*xk{'rstride’: 10, 'cstride’: 10}, color='purple’) 
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d21.plt.xlabel('x') 
d21.plt.ylabel('y’) 
d21.set_figsize() 
ax.set_xlim(-2, 2) 
ax.set_ylim(-2, 2) 
ax.set_zlim(-1, 1) 
ax.dist = 12 


This forms the basis for Newton’s Algorithm discussed in Section 12.3, where we perform 
numerical optimization iteratively finding the best fitting quadratic, and then exactly mini- 
mizing that quadratic. 


A.4.7 A Little Matrix Calculus 


Derivatives of functions involving matrices turn out to be particularly nice. This section 
can become notationally heavy, so may be skipped in a first reading, but it is useful to know 
how derivatives of functions involving common matrix operations are often much cleaner 
than one might initially anticipate, particularly given how central matrix operations are to 
deep learning applications. 


Let’s begin with an example. Suppose that we have some fixed column vector $, and we 
want to take the product function f(x) = 6'x, and understand how the dot product changes 
when we change x. 


A bit of notation that will be useful when working with matrix derivatives in ML is called 
the denominator layout matrix derivative where we assemble our partial derivatives into 
the shape of whatever vector, matrix, or tensor is in the denominator of the differential. In 
this case, we will write 


af 
dxı 
d 
= EE (A.34) 
x df 
dxn 


where we matched the shape of the column vector x. 


If we write out our function into components this is 


f(x) = YS) Bixi = Bix, +--+ BrXn. (A.35) 
i=l 
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If we now take the partial derivative with respect to say 81, note that everything is zero but 
the first term, which is just x; multiplied by 81, so we obtain that 


a 


a ee (A.36) 


or more generally that 


OP es 
im =P (A.37) 


We can now reassemble this into a matrix to see 


; gE Bi 
of. SPS) a= Bp: (A.38) 


dx 


af 


dxn n, 


This illustrates a few factors about matrix calculus that we will often counter throughout 
this section: 


e First, The computations will get rather involved. 


e Second, The final results are much cleaner than the intermediate process, and will al- 
ways look similar to the single variable case. In this case, note that (bx) = b and 
<4 (B"x) = B are both similar. 


e Third, transposes can often appear seemingly from nowhere. The core reason for this is 
the convention that we match the shape of the denominator, thus when we multiply 
matrices, we will need to take transposes to match back to the shape of the original 
term. 


To keep building intuition, let’s try a computation that is a little harder. Suppose that we 
have a column vector x, and a square matrix A and we want to compute 


Z ataw. (A.39) 


To drive towards easier to manipulate notation, let’s consider this problem using Einstein 
notation. In this case we can write the function as 


x" AX = xjajjx;. (A.40) 


To compute our derivative, we need to understand for every k, what is the value of 


d d 
Pema = nore (A.41) 
By the product rule, this is 
d dx; dx ; 
da = a Pa Ti (A.42) 
For a term like s , it is not hard to see that this is one when 7 = k and zero otherwise. 


This means that every term where i and k are different vanish from this sum, so the only 
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terms that remain in that first sum are the ones where i = k. The same reasoning holds for 
the second term where we need j = k. This gives 


FTN AX] = 4kjXj +Xiđik. (A.43) 
dx, 


Now, the names of the indices in Einstein notation are arbitrary—the fact that i and j are 
different is immaterial to this computation at this point, so we can re-index so that they both 
use i to see that 


Faas = AkiXi + XiQik = (aki + Aik) Xj. (A.44) 


Now, here is where we start to need some practice to go further. Let’s try and identify this 
outcome in terms of matrix operations. ax; + aix is the k, i-th component of A+ A™. This 
gives 


-L nanay = [A T A']kixi. (A.45) 


Similarly, this term is now the product of the matrix A + A" by the vector x, so we see 
that 


[Zra] = L naya =[(A+A")x]x. (A.46) 


Thus, we see that the k-th entry of the desired derivative from (A.39) is just the k-th entry 
of the vector on the right, and thus the two are the same. Thus yields 


d 
i 4%) =(A+A‘)x. (A.47) 
x 
This required significantly more work than our last one, but the final result is small. More 
than that, consider the following computation for traditional single variable derivatives: 


d d d. 
ge~ = ax + xa = (a + a)x. (A.48) 


Equivalently 4 (ax?) = 2ax = (a+a)x. Again, we get a result that looks rather like the 
single variable result but with a transpose tossed in. 


At this point, the pattern should be looking rather suspicious, so let’s try to figure out 
why. When we take matrix derivatives like this, let’s first assume that the expression we 
get will be another matrix expression: an expression we can write it in terms of products 
and sums of matrices and their transposes. If such an expression exists, it will need to be 
true for all matrices. In particular, it will need to be true of 1 x 1 matrices, in which case 
the matrix product is just the product of the numbers, the matrix sum is just the sum, and 
the transpose does nothing at all! In other words, whatever expression we get must match 
the single variable expression. This means that, with some practice, one can often guess 
matrix derivatives just by knowing what the associated single variable expression must look 
like! 


Let’s try this out. Suppose that X is a n x m matrix, U is ann Xr and V is an r xm. Let’s 
try to compute 


d 
zy lX -UVI =? (A.49) 
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This computation is important in an area called matrix factorization. For us, however, it is 
just a derivative to compute. Let’s try to imagine what this would be for 1 x 1 matrices. In 
that case, we get the expression 


d 

— (x — uv)? = -2(x — uv)u, (A.50) 
dv 

where, the derivative is rather standard. If we try to convert this back into a matrix expres- 

sion we get 


-£ Ix- UVIB = -2(X - UV)U. (A.51) 


However, if we look at this it does not quite work. Recall that X is n x m, as is UV, so the 
matrix 2(X — UV) isn xm. On the other hand U is n xr, and we cannot multiply an xm 
and an x r matrix since the dimensions do not match! 


We want to get aA, which is the same shape as V, which is r Xx m. So somehow we need 
to take a n x m matrix and a n x r matrix, multiply them together (perhaps with some 
transposes) to get ar x m. We can do this by multiplying UT by (X — UV). Thus, we can 
guess the solution to (A.49) is 


-L IX- UVIE = -2U (X - UV). (A.52) 


To show that this works, we would be remiss to not provide a detailed computation. If 
we already believe that this rule-of-thumb works, feel free to skip past this derivation. To 
compute 


d 2 
zy! X- UVI (A.53) 
we must find for every a, and b 
d d i 
2 ut ; ; 
ial S UVIS = dVab 2 g 2 uu) . (A.54) 
Recalling that all entries of X and U are constants as far as we is concerned, we may 


push the derivative inside the sum, and apply the chain rule to the square to get 


d I dV; 
Tye - UV = Py bs - 2 uvu) - 2, Wik | . (A.55) 


i,j 


dvkj - : 
gu is only non-zero if the k = a and 
Vab 


Jj = b. If either of those conditions do not hold, the term in the sum is zero, and we may 
freely discard it. We see that 


As in the previous derivation, we may note that 


d 
zz |X- UVIE = -25 [ro - Snag (A.56) 


An important subtlety here is that the requirement that k = a does not occur inside the 
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inner sum since that k is a dummy variable which we are summing over inside the inner 
term. For a notationally cleaner example, consider why 


2 (S) -2 (Z= f (A.57) 


l 
From this point, we may start identifying components of the sum. First, 
>, UikVkb = [UV]in. (A.58) 
k 
So the entire expression in the inside of the sum is 
Xib — ` UikVkb = [X - UV]ip. (A.59) 
k 
This means we may now write our derivative as 


d 
dVab 


IX - UV] = -2 $ [X - UV]ipiia- (A.60) 


We want this to look like the a, b element of a matrix so we can use the technique as in the 
previous example to arrive at a matrix expression, which means that we need to exchange 
the order of the indices on tja. If we notice that uja = [U" ]a;, we can then write 


=< IX- UVI = -2 DU" ailX -UV (4.61) 
This is a matrix product, and thus we can conclude that 
Ix - UV] = -2[U" (X - UV) Jan. (A.62) 
and thus we may write the solution to (A.49) 
-£ Ix- UVIE = -2U (X - UV). (A.63) 


This matches the solution we guessed above! 


It is reasonable to ask at this point, “Why can I not just write down matrix versions of 
all the calculus rules I have learned? It is clear this is still mechanical. Why do we not 
just get it over with!” And indeed there are such rules and (Petersen and Pedersen, 2008) 
provides an excellent summary. However, due to the plethora of ways matrix operations 
can be combined compared to single values, there are many more matrix derivative rules 
than single variable ones. It is often the case that it is best to work with the indices, or leave 
it up to automatic differentiation when appropriate. 


A.4.8 Summary 


e In higher dimensions, we can define gradients which serve the same purpose as deriva- 
tives in one dimension. These allow us to see how a multi-variable function changes 
when we make an arbitrary small change to the inputs. 
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The backpropagation algorithm can be seen to be a method of organizing the multi- 
variable chain rule to allow for the efficient computation of many partial derivatives. 


Matrix calculus allows us to write the derivatives of matrix expressions in concise ways. 


A.4.9 Exercises 


1. Given a column vector £, compute the derivatives of both f(x) = 6'x and g(x) = x" £. 
Why do you get the same answer? 


2. Let v be an n dimension vector. What is 2 I|v|]2? 


3. Let L(x, y) = log(e* + e”). Compute the gradient. What is the sum of the components 
of the gradient? 


4, Let f(x,y) = x*y + xy?. Show that the only critical point is (0,0). By considering 
f(x, x), determine if (0,0) is a maximum, minimum, or neither. 


5. Suppose that we are minimizing a function f(x) = g(x) + h(x). How can we geomet- 
rically interpret the condition of V f = 0 in terms of g and A? 


Discussions?°° , 


A.5 Integral Calculus 
E) 


Differentiation only makes up half of the content of a traditional calculus education. The 
other pillar, integration, starts out seeming a rather disjoint question, “What is the area 
underneath this curve?” While seemingly unrelated, integration is tightly intertwined with 
the differentiation via what is known as the fundamental theorem of calculus. 


At the level of machine learning we discuss in this book, we will not need a deep understand- 
ing of integration. However, we will provide a brief introduction to lay the groundwork for 
any further applications we will encounter later on. 


A.5.1 Geometric Interpretation 


Suppose that we have a function f(x). For simplicity, let’s assume that f (x) is non-negative 
(never takes a value less than zero). What we want to try and understand is: what is the 
area contained between f(x) and the x-axis? 


%matplotlib inline 

import torch 

from IPython import display 

from mpl_toolkits import mplot3d 
from d21 import torch as d21 
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torch.arange(-2, 2, @.@1) 
torch. exp(-x**2) 


d21.set_figsize() 

d21.plt.plot(x, f, color='black’) 
d21.plt.fill_between(x.tolistQ, f.tolist()) 
d21.plt.show() 


l 
N 

l 
H 
o 
m. 
v4 


In most cases, this area will be infinite or undefined (consider the area under f(x) = x7), 
so people will often talk about the area between a pair of ends, say a and b. 


torch.arange(-2, 2, 0.01) 
torch. exp(-x**2) 


d21.set_figsize() 

d21.plt.plot(x, f, color='black’) 
d21.plt.fill_between(x.tolist()[50:250], f.tolist()[50:250]) 
d21.plt.show() 


We will denote this area by the integral symbol below: 


b 
Area(A) =f f(x) dx. (A.1) 


The inner variable is a dummy variable, much like the index of a sum in a >}, and so this 
can be equivalently written with any inner value we like: 


b b 
i f(x) dx = f f(z) dz. (A.2) 
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There is a traditional way to try and understand how we might try to approximate such 
integrals: we can imagine taking the region in-between a and b and chopping it into N 
vertical slices. If N is large, we can approximate the area of each slice by a rectangle, and 
then add up the areas to get the total area under the curve. Let’s take a look at an example 
doing this in code. We will see how to get the true value in a later section. 


epsilon = 0.05 

a=0 

b=2 

x = torch.arange(a, b, epsilon) 


f =x / (1 + xx*x2) 


approx = torch.sum(epsilon*f) 
true = torch. log(torch.tensor([5.])) / 2 


d21.set_figsize() 

d21.plt.bar(x, f, width=epsilon, align='edge’) 
d21.plt.plot(x, f, color='black’) 
d21.plt.ylim(l@, 1]) 

d21.plt.show() 


f’approximation: {approx}, truth: {true}’ 


1.0 


0.8 4 


"approximation: @.7944855690002441, truth: tensor([@.8047])' 


The issue is that while it can be done numerically, we can do this approach analytically for 
only the simplest functions like 
b 
f x dx. (A.3) 


Anything somewhat more complex like our example from the code above 


f PET (A.4) 


1+x? 


is beyond what we can solve with such a direct method. 


We will instead take a different approach. We will work intuitively with the notion of the 
area, and learn the main computational tool used to find integrals: the fundamental theorem 
of calculus. This will be the basis for our study of integration. 
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A.5.2 The Fundamental Theorem of Calculus 


To dive deeper into the theory of integration, let’s introduce a function 


F(x) = J Eeo (A.5) 


This function measures the area between 0 and x depending on how we change x. Notice 
that this is everything we need since 


b 
I f(x) dx = F(b) - F(a). (A.6) 


This is a mathematical encoding of the fact that we can measure the area out to the far end- 
point and then subtract off the area to the near end point as indicated in Fig. A.1. 


” P PA 


Visualizing why we may reduce the problem of computing the area under a curve between 
two points to computing the area to the left of a point. 


Thus, we can figure out what the integral over any interval is by figuring out what F(x) 
is. 


To do so, let’s consider an experiment. As we often do in calculus, let’s imagine what hap- 
pens when we shift the value by a tiny bit. From the comment above, we know that 


X+E 
F(x +e) — F(x) = fQ) dy. (A.7) 
This tells us that the function changes by the area under a tiny sliver of a function. 


This is the point at which we make an approximation. If we look at a tiny sliver of area like 
this, it looks like this area is close to the rectangular area with height the value of f(x) and 
the base width e. Indeed, one can show that as € — 0 this approximation becomes better 
and better. Thus we can conclude: 


F(x +e) — F(x) x ef (x). (A.8) 


However, we can now notice: this is exactly the pattern we expect if we were computing 
the derivative of F! Thus we see the following rather surprising fact: 


dF 
z” = f(x). (A.9) 
Ix 
This is the fundamental theorem of calculus. We may write it in expanded form as 
d x 
— dy = i A.l 
ES £0) 4y= 16) (A.10) 


It takes the concept of finding areas (a priori rather hard), and reduces it to a statement 
derivatives (something much more completely understood). One last comment that we 
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must make is that this does not tell us exactly what F (x) is. Indeed F(x) + C for any C has 
the same derivative. This is a fact-of-life in the theory of integration. Thankfully, notice 
that when working with definite integrals, the constants drop out, and thus are irrelevant to 
the outcome. 


b 
f f(x) dx = (F(b) + C) — (F(a) + C) = F(b) - F(a). (A.11) 


This may seem like abstract non-sense, but let’s take a moment to appreciate that it has given 
us a whole new perspective on computing integrals. Our goal is no-longer to do some sort 
of chop-and-sum process to try and recover the area, rather we need only find a function 
whose derivative is the function we have! This is incredible since we can now list many 
rather difficult integrals by just reversing the table from Section A.3.2. For instance, we 
know that the derivative of x” is nx”~!. Thus, we can say using the fundamental theorem 
(A.10) that 


ae dy =x" - 0" =x". (A.12) 
Similarly, we know that the derivative of e* is itself, so that means 

Le dx =e* — e? =e*-—]. (A.13) 
In this way, we can develop the entire theory of integration leveraging ideas from differential 


calculus freely. Every integration rule derives from this one fact. 


A.5.3 Change of Variables 


Just as with differentiation, there are a number of rules which make the computation of 
integrals more tractable. In fact, every rule of differential calculus (like the product rule, 
sum rule, and chain rule) has a corresponding rule for integral calculus (integration by parts, 
linearity of integration, and the change of variables formula respectively). In this section, 
we will dive into what is arguably the most important from the list: the change of variables 
formula. 


First, suppose that we have a function which is itself an integral: 


F(x) = | FO) dy. (A.14) 


Let’s suppose that we want to know how this function looks when we compose it with 
another to obtain F (u(x)). By the chain rule, we know 


d dF du 
ar HO) = Gn “O tae (A.15) 


We can turn this into a statement about integration by using the fundamental theorem 
(A.10) as above. This gives 


FU) -FUO = | EUO) Fay. (A.16) 
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Recalling that F is itself an integral gives that the left hand side may be rewritten to 
be 


oe * dF du 
= — -— dy. A.17 
fa af TEOGA (A.17) 


Similarly, recalling that F is an integral allows us to recognize that ae = f using the 
fundamental theorem (A.10), and thus we may conclude 


u(x) x du 
dy = -— dy. A.18 
fa toes f toon Ga (A.18) 


This is the change of variables formula. 


For a more intuitive derivation, consider what happens when we take an integral of f(u(x)) 
between x and x + e. For a small e, this integral is approximately ef (u(x)), the area of 
the associated rectangle. Now, let’s compare this with the integral of f(y) from u(x) to 


u(x + €). We know that u(x + €) x u(x) + eG (x), so the area of this rectangle is approx- 


imately e% (x) f(u(x)). Thus, to make the area of these two rectangles to agree, we need 


u 


to multiply the first one by du (x) as is illustrated in Fig. A.2. 


flu@)) fy) 
Reparametrize 
— 


X x+E u(x) u(x+e) 


Visualizing the transformation of a single thin rectangle under the change of variables. 


This tells us that 
X+E du u(xt+e) 
[room Zor av= [ Loe (A.19) 


This is the change of variables formula expressed for a single small rectangle. 


If u(x) and f(x) are properly chosen, this can allow for the computation of incredibly 


complex integrals. For instance, if we even chose f(y) = 1 and u(x) = ex (which means 


2 è è 
du (x) = —2xe™™ ), this can show for instance that 


-1 


e 1 
nn | al 1 dy = -2 f ye dy, (A.20) 
e? 0 


and thus by rearranging that 


1 1- -1 
i ye dy=———. (A.21) 
f 2 


A.5.4 A Comment on Sign Conventions 
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Keen-eyed readers will observe something strange about the computations above. Namely, 
computations like 


/ 1 dy=e"!-1<0, (A.22) 


-0 


can produce negative numbers. When thinking about areas, it can be strange to see a neg- 
ative value, and so it is worth digging into what the convention is. 


Mathematicians take the notion of signed areas. This manifests itself in two ways. First, if 
we consider a function f(x) which is sometimes less than zero, then the area will also be 
negative. So for instance 


fo dx = -1. (A.23) 
0 


Similarly, integrals which progress from right to left, rather than left to right are also taken 
to be negative areas 


-1 
I 1 dx =-1. (A.24) 
0 


The standard area (from left to right of a positive function) is always positive. Anything 
obtained by flipping it (say flipping over the x-axis to get the integral of a negative number, 
or flipping over the y-axis to get an integral in the wrong order) will produce a negative 
area. And indeed, flipping twice will give a pair of negative signs that cancel out to have 
positive area 


fo de =A: (A.25) 
0 


If this discussion sounds familiar, it is! In Section A.1 we discussed how the determinant 
represented the signed area in much the same way. 


A.5.5 Multiple Integrals 


In some cases, we will need to work in higher dimensions. For instance, suppose that we 
have a function of two variables, like f(x, y) and we want to know the volume under f 
when x ranges over [a, b] and y ranges over [c, d]. 


# Construct grid and compute function 
x, y = torch.meshgrid(torch.linspace(-2, 2, 101), torch.linspace(-2, 2, 101)) 
z = torch.exp(- x**2 - y**2) 


# Plot function 

ax = d21.plt.figure() .add_subplot(111, projection='3d’') 
ax.plot_wireframe(x, y, Z) 

d21.plt.xlabel('x') 
d21.plt.ylabel('y') 
d21.plt.xticks([-2, -1, 
d21.plt.yticks(L-2, -1, 
d21.set_figsize() 


, il, 2]) 


) 
Oo ip 21) 


(continues on next page) 
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(continued from previous page) 


ax.set_xlim(-2, 2) 
ax.set_ylim(-2, 2) 
ax.set_zlim(@, 1) 
ax.dist = 12 


We write this as 
f f(x,y) dx dy. (A.26) 
[a,b|x[c,d] 


Suppose that we wish to compute this integral. My claim is that we can do this by iteratively 
computing first the integral in x and then shifting to the integral in y, that is to say 


dj pb 
pea y) dx} dy. A.27 
i BOS Í (/ aon 7 y (A.27) 


Let’s see why this is. 


Consider the figure above where we have split the function into e€ x € squares which we will 
index with integer coordinates i, j. In this case, our integral is approximately 


Ae T (A.28) 
i,j 
Once we discretize the problem, we may add up the values on these squares in whatever 


order we like, and not worry about changing the values. This is illustrated in Fig. A.3. In 
particular, we can say that 


2AE ef (ei, 7 . (A.29) 


J i 


2 


Illustrating how to decompose a sum over many squares as a sum over first the columns 
(1), then adding the column sums together (2). 
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The sum on the inside is precisely the discretization of the integral 


b 
G(ej) S f(x, €j) dx. (A.30) 


Finally, notice that if we combine these two expressions we get 


Jecl) x [ooo ay = f 


f(x,y) dx dy. (A.31) 
7 c [a,b]x[c,d] 


Thus putting it all together, we have that 


d b 
,y) dx dy = y) dx) dy. A.32 
Soneat y) dx dy 1 (/ f (x,y) 7 y (A.32) 


Notice that, once discretized, all we did was rearrange the order in which we added a list 
of numbers. This may make it seem like it is nothing, however this result (called Fubini’s 
Theorem) is not always true! For the type of mathematics encountered when doing ma- 
chine learning (continuous functions), there is no concern, however it is possible to create 
examples where it fails (for example the function f(x, y) = xy(x? — y?)/(x? + y2)? 
rectangle [0,2] x [0, 1]). 


over the 


Note that the choice to do the integral in x first, and then the integral in y was arbitrary. We 
could have equally well chosen to do y first and then x to see 


b d 
y) dx dy = wy) dy) dx. A.33 
hora y) dx dy f (/ f (x,y) s) x (A.33) 


Often times, we will condense down to vector notation, and say that for U = [a, b] x [c,d] 
this is 


f fe ax. (A.34) 
U 


A.5.6 Change of Variables in Multiple Integrals 


As with single variables in (A.18), the ability to change variables inside a higher dimen- 
sional integral is a key tool. Let’s summarize the result without derivation. 


We need a function that reparametrizes our domain of integration. We can take this to be 
@ : R” — R”, that is any function which takes in n real variables and returns another n. To 
keep the expressions clean, we will assume that ¢ is injective which is to say it never folds 
over itself (6(x) = d(y) = x= y). 


In this case, we can say that 


f fea dx= | FO) lde(D400)| ax (A.35) 
(U) U 
where D¢ is the Jacobian of ¢, which is the matrix of partial derivatives of ġ = (@1(%1,...,Xn),---,O@n(%1,-- 
Oo... OG 
Ox) Oxn 
DOs = te. Falls (A.36) 
Bhn 3n 


Ox C Dn 


SXx)); 
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Looking closely, we see that this is similar to the single variable chain rule (A.18), except 
we have replaced the term du (x) with |det(D¢(x))|. Let’s see how we can to interpret 
this term. Recall that the du (x) term existed to say how much we stretched our x-axis by 
applying u. The same process in higher dimensions is to determine how much we stretch 
the area (or volume, or hyper-volume) of a little square (or little hyper-cube) by applying @. 
If @ was the multiplication by a matrix, then we know how the determinant already gives 
the answer. 


With some work, one can show that the Jacobian provides the best approximation to a 
multivariable function ¢ at a point by a matrix in the same way we could approximate by 
lines or planes with derivatives and gradients. Thus the determinant of the Jacobian exactly 
mirrors the scaling factor we identified in one dimension. 


It takes some work to fill in the details to this, so do not worry if they are not clear now. 
Let’s see at least one example we will make use of later on. Consider the integral 


J J ey dx dy. (A.37) 


Playing with this integral directly will get us no-where, but if we change variables, we can 
make significant progress. If we let ø(r,0) = (rcos(@),rsin(@)) (which is to say that 
x = rcos(@), y = rsin(@)), then we can apply the change of variable formula to see that 
this is the same thing as 


co 2n A 
f i e” |det(DCE(x))| dé dr, (A.38) 
0 0 
where 
|det(D GE(x))| = |det rier pe |= r(cos?(@) + in®(@) =r. (A.39) 


Thus, the integral is 


o0 27 o0 
/ J re” dð dr= 2r f re” dr=n, (A.40) 
0 0 0 


where the final equality follows by the same computation that we used in section Section 
A.5.3. 


We will meet this integral again when we study continuous random variables in Section 
A.6. 


A.5.7 Summary 


e The theory of integration allows us to answer questions about areas or volumes. 


e The fundamental theorem of calculus allows us to leverage knowledge about derivatives 
to compute areas via the observation that the derivative of the area up to some point 
is given by the value of the function being integrated. 


e Integrals in higher dimensions can be computed by iterating single variable integrals. 


Random Variables 


A.5.8 Exercises 
L. What is f? t dx? 
2. Use the change of variables formula to integrate fx sin(x?) dx. 
3. What is fig 12 Xy dx dy? 


4. Use the change of variables formula to compute L L xy(x? = y?) [GC + 97)? dy dx 
and i A f(x,y) = xy (x? — y?) /(x? + y?)> dx dy to see they are different. 


Discussions ?®4, 


A.6 Random Variables 
e) 


In Section 2.6 we saw the basics of how to work with discrete random variables, which in 
our case refer to those random variables which take either a finite set of possible values, or 
the integers. In this section, we develop the theory of continuous random variables, which 
are random variables which can take on any real value. 


A.6.1 Continuous Random Variables 


Continuous random variables are a significantly more subtle topic than discrete random 
variables. A fair analogy to make is that the technical jump is comparable to the jump 
between adding lists of numbers and integrating functions. As such, we will need to take 
some time to develop the theory. 


From Discrete to Continuous 


To understand the additional technical challenges encountered when working with contin- 
uous random variables, let’s perform a thought experiment. Suppose that we are throwing 
a dart at the dart board, and we want to know the probability that it hits exactly 2cm from 
the center of the board. 


To start with, we imagine measuring a single digit of accuracy, that is to say with bins for 
Ocm, 1cm, 2cm, and so on. We throw say 100 darts at the dart board, and if 20 of them fall 
into the bin for 2cm we conclude that 20% of the darts we throw hit the board 2cm away 
from the center. 


However, when we look closer, this does not match our question! We wanted exact equality, 
whereas these bins hold all that fell between say 1.5cm and 2.5cm. 


Undeterred, we continue further. We measure even more precisely, say 1.9cm, 2.0cm, 
2.1cm, and now see that perhaps 3 of the 100 darts hit the board in the 2.0cm bucket. Thus 
we conclude the probability is 3%. 
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However, this does not solve anything! We have just pushed the issue down one digit further. 
Let’s abstract a bit. Imagine we know the probability that the first k digits match with 
2.00000... and we want to know the probability it matches for the first k + 1 digits. It is 
fairly reasonable to assume that the k + 1" digit is essentially a random choice from the 
set {0,1,2,...,9}. At least, we cannot conceive of a physically meaningful process which 
would force the number of micrometers away form the center to prefer to end in a 7 vs a 
3. 


What this means is that in essence each additional digit of accuracy we require should 
decrease probability of matching by a factor of 10. Or put another way, we would expect 
that 


P(distance is 2.00..., to k digits) ~ p - 1074. A.l 
g P 


The value p essentially encodes what happens with the first few digits, and the 10~* handles 
the rest. 


Notice that if we know the position accurate to k = 4 digits after the decimal, that means 
we know the value falls within the interval say [1.99995, 2.00005] which is an interval of 
length 2.00005 — 1.99995 = 1074. Thus, if we call the length of this interval €, we can 
say 


P(distance is in an e-sized interval around 2) ~ €: p. (A.2) 


Let’s take this one final step further. We have been thinking about the point 2 the entire 
time, but never thinking about other points. Nothing is different there fundamentally, but 
it is the case that the value p will likely be different. We would at least hope that a dart 
thrower was more likely to hit a point near the center, like 2cm rather than 20cm. Thus, the 
value p is not fixed, but rather should depend on the point x. This tells us that we should 
expect 


P(distance is in an €-sized interval around x) ~ € - p(x). (A.3) 


Indeed, (A.3) precisely defines the probability density function. It is a function p(x) which 
encodes the relative probability of hitting near one point vs. another. Let’s visualize what 
such a function might look like. 


%matplotlib inline 

import torch 

from IPython import display 
from d21 import torch as d21 


torch.pi = torch.acos(torch.zeros(1)).item() x 2 # Define pi in torch 

# Plot the probability density function for some random variable 

x = torch.arange(-5, 5, 0.01) 

p = @.2xtorch.exp(-(x - 3)**2 / 2)/torch.sqrt(2 * torch.tensor(torch.pi)) + \ 
@.8xtorch.exp(-(x + 1)**2 / 2)/torch.sqrt(2 * torch. tensor(torch.pi)) 


d21.plot(x, p, 'x', ‘Density’) 
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0.37 


0.0 + 


The locations where the function value is large indicates regions where we are more likely 
to find the random value. The low portions are areas where we are unlikely to find the 
random value. 


Probability Density Functions 


Let’s now investigate this further. We have already seen what a probability density function 
is intuitively for a random variable X, namely the density function is a function p(x) so 
that 


P(X is in an e-sized interval around x) ~ €- p(x). (A.4) 
But what does this imply for the properties of p(x)? 
First, probabilities are never negative, thus we should expect that p(x) > 0 as well. 


Second, let’s imagine that we slice up the R into an infinite number of slices which are € 
wide, say with slices (e-i, €- (i+1)]. For each of these, we know from (A.4) the probability 
is approximately 


P(X is in an e-sized interval around x) ~ €» p(e- i), (A.5) 


so summed over all of them it should be 


P(X ER) $ €- ple-i). (A.6) 


l 


This is nothing more than the approximation of an integral discussed in Section A.5, thus 
we can say that 


P(XeR)= [ve dx. (A.7) 


We know that P(X € R) = 1, since the random variable must take on some number, we 
can conclude that for any density 


iz p(x) dx = 1. (A.8) 


Indeed, digging into this further shows that for any a, and b, we see that 


b 
P(X € (a, b]) =f p(x) dx. (A.9) 


a 
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We may approximate this in code by using the same discrete approximation methods as 
before. In this case we can approximate the probability of falling in the blue region. 


# Approximate probability using numerical integration 

epsilon = 0.01 

x = torch.arange(-5, 5, 0.01) 

p = @.2xtorch.exp(-(x - 3)**2 / 2) / torch.sqrt(2 x torch.tensor(torch.pi)) +\ 
@.8xtorch.exp(-(x + 1)**2 / 2) / torch.sqrt(2 * torch.tensor(torch.pi)) 


d21.set_figsize() 

d21.plt.plot(x, p, color='black’) 

d21.plt.fill_between(x. tolist()[300:800], p.tolist()[300:800]) 
d21.plt.show() 


f'approximate Probability: {torch.sum(epsilonxp[300:800])}’ 


0.34 


0.2 4 


0.14 


0.0 4 


‘approximate Probability: @.773617148399353' 


It turns out that these two properties describe exactly the space of possible probability 
density functions (or p.d.f.’s for the commonly encountered abbreviation). They are non- 
negative functions p(x) > 0 such that 


[ve dx = 1. (A.10) 


o0 


We interpret this function by using integration to obtain the probability our random variable 
is in a specific interval: 


b 
P(X € (a,b]) =f p(x) dx. (A.11) 


In Section A.8 we will see a number of common distributions, but let’s continue working 
in the abstract. 


Cumulative Distribution Functions 


In the previous section, we saw the notion of the p.d.f. In practice, this is a commonly en- 
countered method to discuss continuous random variables, but it has one significant pitfall: 
that the values of the p.d.f. are not themselves probabilities, but rather a function that we 
must integrate to yield probabilities. There is nothing wrong with a density being larger 
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than 10, as long as it is not larger than 10 for more than an interval of length 1/10. This 
can be counter-intuitive, so people often also think in terms of the cumulative distribution 
function, or c.d.f., which is a probability. 


In particular, by using (A.11), we define the c.d.f. for a random variable X with density 
p(x) by 


F(x) = J. p(x) dx = P(X < x). (A.12) 
Let’s observe a few properties. 
e F(x) > 0 asx > —o%. 
e F(x) > lasx > œ. 
e F(x) is non-decreasing (y > x => F(y) = F(x)). 
e F(x) is continuous (has no jumps) if X is a continuous random variable. 


With the fourth bullet point, note that this would not be true if X were discrete, say taking 
the values 0 and 1 both with probability 1/2. In that case 


0 x<0, 
F(x) = 5 x <1, (A.13) 
1 xel. 


In this example, we see one of the benefits of working with the c.d.f., the ability to deal 
with continuous or discrete random variables in the same framework, or indeed mixtures 
of the two (flip a coin: if heads return the roll of a die, if tails return the distance of a dart 
throw from the center of a dart board). 


Means 


Suppose that we are dealing with a random variables X. The distribution itself can be hard 
to interpret. It is often useful to be able to summarize the behavior of a random variable 
concisely. Numbers that help us capture the behavior of a random variable are called sum- 
mary Statistics. The most commonly encountered ones are the mean, the variance, and the 
standard deviation. 


The mean encodes the average value of a random variable. If we have a discrete random 
variable X, which takes the values x; with probabilities p;, then the mean is given by the 
weighted average: sum the values times the probability that the random variable takes on 
that value: 

ux = E[X] =X api- (A.14) 


l 


The way we should interpret the mean (albeit with caution) is that it tells us essentially 
where the random variable tends to be located. 


As a minimalistic example that we will examine throughout this section, let’s take X to be 


966 


Mathematics for Deep Learning 


the random variable which takes the value a — 2 with probability p, a+2 with probability p 
and a with probability 1 — 2p. We can compute using (A.14) that, for any possible choice 
of a and p, the mean is 

ux = E[X] = xipi = (a-2)p +a(1 - 2p) + (a +2)p =a. (A.15) 


l 


Thus we see that the mean is a. This matches the intuition since a is the location around 
which we centered our random variable. 


Because they are helpful, let’s summarize a few properties. 
e For any random variable X and numbers a and b, we have that Hax+b = aux + b. 
e If we have two random variables X and Y, we have ux+y = Mx + uy. 


Means are useful for understanding the average behavior of a random variable, however the 
mean is not sufficient to even have a full intuitive understanding. Making a profit of $10+$1 
per sale is very different from making $10 + $15 per sale despite having the same average 
value. The second one has a much larger degree of fluctuation, and thus represents a much 
larger risk. Thus, to understand the behavior of a random variable, we will need at minimum 
one more measure: some measure of how widely a random variable fluctuates. 


Variances 


This leads us to consider the variance of a random variable. This is a quantitative measure 
of how far a random variable deviates from the mean. Consider the expression X — ux. 
This is the deviation of the random variable from its mean. This value can be positive 
or negative, so we need to do something to make it positive so that we are measuring the 
magnitude of the deviation. 


A reasonable thing to try is to look at |X — ux|, and indeed this leads to a useful quan- 
tity called the mean absolute deviation, however due to connections with other areas of 
mathematics and statistics, people often use a different solution. 


In particular, they look at (X — ux)”. If we look at the typical size of this quantity by taking 
the mean, we arrive at the variance 


oy = Var(X) = E [(X - ux)}’] = E[X’] - py. (A.16) 


The last equality in (A.16) holds by expanding out the definition in the middle, and applying 
the properties of expectation. 


Let’s look at our example where X is the random variable which takes the value a — 2 with 
probability p, a + 2 with probability p and a with probability 1 — 2p. In this case ux = a, 
so all we need to compute is E [x 2] . This can readily be done: 


E [X?] = (a -2}°p +a°(1 - 2p) + (a +2)?°p =a’ + 8p. (A.17) 
Thus, we see that by (A.16) our variance is 


ox, = Var(X) = E[X?] - py =a’ +8p - & = 8p. (A.18) 
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This result again makes sense. The largest p can be is 1/2 which corresponds to picking 
a — 2 or a + 2 with a coin flip. The variance of this being 4 corresponds to the fact that 
both a — 2 and a + 2 are 2 units away from the mean, and 2? = 4. On the other end of the 
spectrum, if p = 0, this random variable always takes the value 0 and so it has no variance 
at all. 


We will list a few properties of variance below: 
e For any random variable X, Var(X) > 0, with Var(X) = 0 if and only if X is a constant. 
e For any random variable X and numbers a and b, we have that Var(aX +b) = a*Var(X). 


e If we have two independent random variables X and Y, we have Var(X + Y) = Var(X) + 
Var(Y). 


When interpreting these values, there can be a bit of a hiccup. In particular, let’s try imag- 
ining what happens if we keep track of units through this computation. Suppose that we are 
working with the star rating assigned to a product on the web page. Then a, a—2, and a+2 
are all measured in units of stars. Similarly, the mean jx is then also measured in stars 
(being a weighted average). However, if we get to the variance, we immediately encounter 
an issue, which is we want to look at (X — x), which is in units of squared stars. This 
means that the variance itself is not comparable to the original measurements. To make it 
interpretable, we will need to return to our original units. 


Standard Deviations 


This summary statistics can always be deduced from the variance by taking the square root! 
Thus we define the standard deviation to be 


ox = VVar(X). (A.19) 


In our example, this means we now have the standard deviation is ox = 242p. If we are 
dealing with units of stars for our review example, oy is again in units of stars. 


The properties we had for the variance can be restated for the standard deviation. 
e For any random variable X, ox > 0. 


e For any random variable X and numbers a and b, we have that Cax+p = |alox 


e If we have two independent random variables X and Y, we have ox+y = 4 lox + oe 


It is natural at this moment to ask, “If the standard deviation is in the units of our original 
random variable, does it represent something we can draw with regards to that random 
variable?” The answer is a resounding yes! Indeed much like the mean told us the typical 
location of our random variable, the standard deviation gives the typical range of variation 
of that random variable. We can make this rigorous with what is known as Chebyshev’s 
inequality: 

1 


P(X ¢ [ux - aox, ux + aox]) < z (A.20) 
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Or to state it verbally in the case of a = 10, 99% of the samples from any random variable 
fall within 10 standard deviations of the mean. This gives an immediate interpretation to 
our standard summary statistics. 


To see how this statement is rather subtle, let’s take a look at our running example again 
where X is the random variable which takes the value a — 2 with probability p, a + 2 with 
probability p and a with probability 1 — 2p. We saw that the mean was a and the standard 
deviation was 22p. This means, if we take Chebyshev’s inequality (A.20) with a = 2, 
we see that the expression is 


p(x¢ [a - 4y2p,a+4y2p]] < 7 (A.21) 


This means that 75% of the time, this random variable will fall within this interval for any 
value of p. Now, notice that as p — 0, this interval also converges to the single point a. 
But we know that our random variable takes the values a — 2, a, and a +2 only so eventually 
we can be certain a — 2 and a + 2 will fall outside the interval! The question is, at what p 
does that happen. So we want to solve: for what p does a + 4/2p = a +2, which is solved 
when p = 1/8, which is exactly the first p where it could possibly happen without violating 
our claim that no more than 1/4 of samples from the distribution would fall outside the 
interval (1/8 to the left, and 1/8 to the right). 


Let’s visualize this. We will show the probability of getting the three values as three vertical 
bars with height proportional to the probability. The interval will be drawn as a horizontal 
line in the middle. The first plot shows what happens for p > 1/8 where the interval safely 
contains all points. 


# Define a helper to plot these figures 
def plot_chebyshev(a, p): 
d21.set_figsize() 
d21.plt.stem(La-2, a, at2], [p, 1-2*p, p], use_line_collection=True) 
d21.plt.xlim([-4, 4]) 
d21.plt.xlabel('x') 
d21.plt.ylabel('p.m.f.') 


d21.plt.hlines(@.5, a - 4 x torch.sqrt(2 * p), 

a + 4 x torch.sqrt(2 * p), ‘black’, lw=4) 
d21.plt.vlines(a - 4 x torch.sqrt(2 x p), 0.53, 0.47, ‘black’, lw=1) 
d21.plt.vlines(a + 4 x torch.sqrt(2 x p), 0.53, 0.47, ‘black’, lw=1) 
d21.plt.title(f'p = {p:.3f}’) 


d21.plt.show() 


# Plot interval when p > 1/8 
plot_chebyshev(@.2, torch. tensor(@.2)) 


The second shows that at p = 1/8, the interval exactly touches the two points. This shows 
that the inequality is sharp, since no smaller interval could be taken while keeping the 
inequality true. 
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p = 0.200 

0.6 4 

0.4 4 
= 
E 
a 

0.2 4 

0.0 + 

—4 -2 0 2 4 


# Plot interval when p = 1/8 
plot_chebyshev(@.@, torch. tensor(@.125)) 


p = 0.125 
0.6 + 
£0.44 
a 
0.2 4 
0.0 4 
-4 -2 0 2 4 


The third shows that for p < 1/8 the interval only contains the center. This does not invali- 
date the inequality since we only needed to ensure that no more than 1/4 of the probability 
falls outside the interval, which means that once p < 1/8, the two points at a — 2 anda+2 
can be discarded. 


# Plot interval when p < 1/8 
plot_chebyshev(@.2, torch. tensor(@.@5)) 


p = 0.050 
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Means and Variances in the Continuum 


This has all been in terms of discrete random variables, but the case of continuous random 
variables is similar. To intuitively understand how this works, imagine that we split the 
real number line into intervals of length e€ given by (ei, €(i + 1)]. Once we do this, our 
continuous random variable has been made discrete and we can use (A.14) say that 


ux © SaP(x e (ei, €(i+ 1)]) 


A.22 
~ X (e)pxlei)e, a 


where px is the density of X. This is an approximation to the integral of xpx(x), so we 
can conclude that 


Hx = [ xpx(x) dx. (A.23) 
Similarly, using (A.16) the variance can be written as 
co o0 2 
of = E[X?] - ph = / x?px(x) dx - | J xpx(x) ax} . (A.24) 


Everything stated above about the mean, the variance, and the standard deviation still ap- 
plies in this case. For instance, if we consider the random variable with density 


1 xe[0,1], 
p(x) = | (A.25) 
0 otherwise. 


we can compute 


oo 1 
Lx = [ xp(x) dx =f x dx = ni (A.26) 
- 0 


o0 


and 


i 1) 1 1_ 1 

gx =f x? p(x) dx — (3) =- =, (A.27) 
As a warning, let’s examine one more example, known as the Cauchy distribution. This is 
the distribution with p.d.f. given by 

(A.28) 


rega 


# Plot the Cauchy distribution p.d.f. 
torch.arange(-5, 5, 0.01) 
1 / (1 + x**2) 


CMI OEO fo, 9 5 ‘Dallaihs 7) 


This function looks innocent, and indeed consulting a table of integrals will show it has 
area one under it, and thus it defines a continuous random variable. 
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To see what goes astray, let’s try to compute the variance of this. This would involve using 


(A.16) computing 
o0 2 
/ “= dx. (A.29) 


The function on the inside looks like this: 


# Plot the integrand needed to compute the variance 
x = torch.arange(-20, 20, 0.01) 
p = x*x*2 / (1 + x**2) 


d21.plot(x, p, ‘x’, ‘integrand’) 


integrand 


This function clearly has infinite area under it since it is essentially the constant one with a 
small dip near zero, and indeed we could show that 


oo 2 
f * dx=. (A.30) 


oo l +x? 


This means it does not have a well-defined finite variance. 


However, looking deeper shows an even more disturbing result. Let’s try to compute the 
mean using (A.14). Using the change of variables formula, we see 


> & 1 f° il 
= dx = du. A.31 
= S. ie a Pa ( ) 


The integral inside is the definition of the logarithm, so this is in essence log(co) = œo, so 
there is no well-defined average value either! 
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Machine learning scientists define their models so that we most often do not need to deal 
with these issues, and will in the vast majority of cases deal with random variables with 
well-defined means and variances. However, every so often random variables with heavy 
tails (that is those random variables where the probabilities of getting large values are large 
enough to make things like the mean or variance undefined) are helpful in modeling physical 
systems, thus it is worth knowing that they exist. 


Joint Density Functions 


The above work all assumes we are working with a single real valued random variable. But 
what if we are dealing with two or more potentially highly correlated random variables? 
This circumstance is the norm in machine learning: imagine random variables like R;,; 
which encode the red value of the pixel at the (i, j) coordinate in an image, or P, which is 
arandom variable given by a stock price at time t. Nearby pixels tend to have similar color, 
and nearby times tend to have similar prices. We cannot treat them as separate random 
variables, and expect to create a successful model (we will see in Section A.9 a model that 
under-performs due to such an assumption). We need to develop the mathematical language 
to handle these correlated continuous random variables. 


Thankfully, with the multiple integrals in Section A.5 we can develop such a language. 
Suppose that we have, for simplicity, two random variables X, Y which can be correlated. 
Then, similar to the case of a single variable, we can ask the question: 


P(X is in an e-sized interval around x and Y is in an e-sized interval around y). (A.32) 


Similar reasoning to the single variable case shows that this should be approximately 


P(X is in an e-sized interval around x and Y is in an e-sized interval around y) ~ e* p(x, y), 
(A.33) 


for some function p(x, y). This is referred to as the joint density of X and Y. Similar 
properties are true for this as we saw in the single variable case. Namely: 

e p(x,y) 20; 

e fo p(x, y) dx dy =1; 

e P((X,Y)€D)= Jo p(x, y) dx dy. 


In this way, we can deal with multiple, potentially correlated random variables. If we wish 
to work with more than two random variables, we can extend the multivariate density to as 
many coordinates as desired by considering p(x) = p(x1,...,Xn). The same properties of 
being non-negative, and having total integral of one still hold. 


Marginal Distributions 


When dealing with multiple variables, we oftentimes want to be able to ignore the rela- 
tionships and ask, “how is this one variable distributed?” Such a distribution is called a 
marginal distribution. 
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To be concrete, let’s suppose that we have two random variables X,Y with joint density 
given by px,y (x, y). We will be using the subscript to indicate what random variables the 
density is for. The question of finding the marginal distribution is taking this function, and 
using it to find px (x). 


As with most things, it is best to return to the intuitive picture to figure out what should be 
true. Recall that the density is the function px so that 


P(X € [x,x +€]) © €- px(x). (A.34) 


There is no mention of Y, but if all we are given is px,y, we need to include Y somehow. 
We can first observe that this is the same as 


P(X € [x,x +e], and Y € R) x €- px(x). (A.35) 


Our density does not directly tell us about what happens in this case, we need to split into 
small intervals in y as well, so we can write this as 


ce- px(x) 7z X P(X € [x,x+e],and Y € [e-i €- (i+ 1)]) 


(A.36) 
N ` epxy(x, e-i). 


Joint Probability Marginal Probability 


By summing along the columns of our array of probabilities, we are able to obtain the 
marginal distribution for just the random variable represented along the z-axis. 


This tells us to add up the value of the density along a series of squares in a line as is shown 
in Fig. A.1. Indeed, after canceling one factor of epsilon from both sides, and recognizing 
the sum on the right is the integral over y, we can conclude that 


Px(x) = )) epxy (xei) 
A (A.37) 
x / Px.y (x,y) dy. 


Thus we see 


px(x) i Px.y (x,y) dy. (A.38) 


o0 


This tells us that to get a marginal distribution, we integrate over the variables we do not 
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care about. This process is often referred to as integrating out or marginalized out the 
unneeded variables. 


Covariance 


When dealing with multiple random variables, there is one additional summary statistic 
which is helpful to know: the covariance. This measures the degree that two random vari- 
able fluctuate together. 


Suppose that we have two random variables X and Y, to begin with, let’s suppose they 
are discrete, taking on values (x;, yj) with probability p;;. In this case, the covariance is 
defined as 


oxy = Cov(X, Y) = Dai — ux); ~ wy) Piz = ELXY] - ELXIELY]. (4.39) 
i,j 

To think about this intuitively: consider the following pair of random variables. Suppose 
that X takes the values 1 and 3, and Y takes the values —1 and 3. Suppose that we have the 

following probabilities 

= £ _P 

P(X =landY=-1)= > 
1- 
P(X =1 and =3)=—*, 
(A.40) 


i 
P(X =3 and Y = -1)=—*, 


P(X =3 and Y = 3) = 5, 
where p is a parameter in [0, 1] we get to pick. Notice that if p = 1 then they are both always 
their minimum or maximum values simultaneously, and if p = 0 they are guaranteed to take 
their flipped values simultaneously (one is large when the other is small and vice versa). 
If p = 1/2, then the four possibilities are all equally likely, and neither should be related. 


Let’s compute the covariance. First, note ux = 2 and uy = 1, so we may compute using 
(A.39): 


Cov(X, Y) = X (ai - Hx) (05 - By) Pi 
i,j 
MEDSE DZ+( -X83 -) 


2 


l-p 


P i B-)(-1-1) 5 


+(3-2)(3- D 


=4p-2. 
(A.41) 


When p = 1 (the case where they are both maximally positive or negative at the same time) 
has a covariance of 2. When p = 0 (the case where they are flipped) the covariance is —2. 
Finally, when p = 1/2 (the case where they are unrelated), the covariance is 0. Thus we 
see that the covariance measures how these two random variables are related. 


A quick note on the covariance is that it only measures these linear relationships. More 
complex relationships like X = Y? where Y is randomly chosen from {—2, —1, 0, 1,2} with 
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equal probability can be missed. Indeed a quick computation shows that these random vari- 
ables have covariance zero, despite one being a deterministic function of the other. 


For continuous random variables, much the same story holds. At this point, we are pretty 
comfortable with doing the transition between discrete and continuous, so we will provide 
the continuous analogue of (A.39) without any derivation. 


oxy = f -DG -HPD de dy. (4.42) 


For visualization, let’s take a look at a collection of random variables with tunable covari- 
ance. 


# Plot a few random variables adjustable covariance 
covs = [-0.9, 0.0, 1.2] 

d21.plt.figure(figsize=(12, 3)) 

for i in range(3): 

torch. randn(500) 

covs[i]*X + torch. randn(50Q) 


d21.plt.subplot(1, 4, i+1) 

d21.plt.scatter(X.numpy(), Y.numpy()) 

d21.plt.xlabel('X') 

d21.plt.ylabel('Y') 

d21.plt.title(f'cov = {covs[i]}’) 
d21.plt.show() 


Let’s see some properties of covariances: 
e For any random variable X, Cov(X, X) = Var(X). 


e For any random variables X, Y and numbers a and b, Cov(aX +b, Y) = Cov(X, aY+b) = 
aCov(X,Y). 


e If X and Y are independent then Cov(X, Y) = 0. 


In addition, we can use the covariance to expand a relationship we saw before. Recall that 
is X and Y are two independent random variables then 


Var(X + Y) = Var(X) + Var(Y). (A.43) 
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With knowledge of covariances, we can expand this relationship. Indeed, some algebra can 
show that in general, 


Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y). (A.44) 


This allows us to generalize the variance summation rule for correlated random variables. 


Correlation 


As we did in the case of means and variances, let’s now consider units. If X is measured in 
one unit (say inches), and Y is measured in another (say dollars), the covariance is measured 
in the product of these two units inches x dollars. These units can be hard to interpret. What 
we will often want in this case is a unit-less measurement of relatedness. Indeed, often we 
do not care about exact quantitative correlation, but rather ask if the correlation is in the 
same direction, and how strong the relationship is. 


To see what makes sense, let’s perform a thought experiment. Suppose that we convert 
our random variables in inches and dollars to be in inches and cents. In this case the ran- 
dom variable Y is multiplied by 100. If we work through the definition, this means that 
Cov(X, Y) will be multiplied by 100. Thus we see that in this case a change of units change 
the covariance by a factor of 100. Thus, to find our unit-invariant measure of correlation, 
we will need to divide by something else that also gets scaled by 100. Indeed we have a 
clear candidate, the standard deviation! Indeed if we define the correlation coefficient to 
be 
Cov(X, Y) 
OxOy i 


P(X,Y) = (A.45) 


we see that this is a unit-less value. A little mathematics can show that this number is 
between —1 and | with 1 meaning maximally positively correlated, whereas —1 means 
maximally negatively correlated. 


Returning to our explicit discrete example above, we can see that ox = 1 and oy = 2, 
so we can compute the correlation between the two random variables using (A.45) to see 
that 
4p-2 
(X,Y) = > =2p-1. (A.46) 
This now ranges between — 1 and 1 with the expected behavior of 1 meaning most correlated, 
and —1 meaning minimally correlated. 


As another example, consider X as any random variable, and Y = aX + b as any linear 
deterministic function of X. Then, one can compute that 


OY = Cax+b = |alox, (A.47) 
Cov(X, Y) = Cov(X, aX + b) = aCov(X, X) = aVar(X), (A.48) 

and thus by (A.45) that 
p(X,¥) = I = = sien(a), (A.49) 
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Thus we see that the correlation is +1 for any a > 0, and —1 for any a < 0 illustrating that 
correlation measures the degree and directionality the two random variables are related, 
not the scale that the variation takes. 


Let’s again plot a collection of random variables with tunable correlation. 


# Plot a few random variables adjustable correlations 
cors = [-0.9, 0.0, 1.0] 
d21.plt.figure(figsize=(12, 3)) 
for i in range(3): 
X = torch. randn(500) 
Y = cors[i] * X + torch.sqrt(torch.tensor(1) - 
cors[i]**2) * torch.randn(500) 


d21.plt.subplot(1, 4, i + 1) 

d21.plt.scatter(X.numpy(), Y.numpy()) 

d21.plt.xlabel('X') 

d21.plt.ylabel('Y') 

d21.plt.title(f'cor = {cors[i]}’) 
d21.plt.show() 


cor = -0.9 cor = 0.0 cor = 1.0 


Let’s list a few properties of the correlation below. 
e For any random variable X, p(X, X) = 1. 


e For any random variables X,Y and numbers a and b, p(aX +b, Y) = p(X,aY + b) = 
p(X,Y). 


e If X and Y are independent with non-zero variance then p(X, Y) = 0. 


As a final note, you may feel like some of these formulae are familiar. Indeed, if we expand 
everything out assuming that ux = uy = 0, we see that this is 


Dag XiViPij 


2. = (A.50) 
VEn Dij Y5Pij 


This looks like a sum of a product of terms divided by the square root of sums of terms. 
This is exactly the formula for the cosine of the angle between two vectors v, w with the 


P(X, Y) = 
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different coordinates weighted by p;;: 


MEW 2i ViWi 


IIvilliwil sv? iF we (A.51) 


Indeed if we think of norms as being related to standard deviations, and correlations as 
being cosines of angles, much of the intuition we have from geometry can be applied to 
thinking about random variables. 


cos(@) = 


A.6.2 Summary 


e Continuous random variables are random variables that can take on a continuum of val- 
ues. They have some technical difficulties that make them more challenging to work 
with compared to discrete random variables. 


e The probability density function allows us to work with continuous random variables by 
giving a function where the area under the curve on some interval gives the probability 
of finding a sample point in that interval. 


e The cumulative distribution function is the probability of observing the random variable 
to be less than a given threshold. It can provide a useful alternate viewpoint which 
unifies discrete and continuous variables. 


e The mean is the average value of a random variable. 


e The variance is the expected square of the difference between the random variable and 
its mean. 


e The standard deviation is the square root of the variance. It can be thought of as mea- 
suring the range of values the random variable may take. 


e Chebyshev’s inequality allows us to make this intuition rigorous by giving an explicit 
interval that contains the random variable most of the time. 


e Joint densities allow us to work with correlated random variables. We may marginalize 
joint densities by integrating over unwanted random variables to get the distribution 
of the desired random variable. 


e The covariance and correlation coefficient provide a way to measure any linear relation- 
ship between two correlated random variables. 


A.6.3 Exercises 


1. Suppose that we have the random variable with density given by p(x) = 4 for x > 1 


x? = 
and p(x) = 0 otherwise. What is P(X > 2)? 


2. The Laplace distribution is a random variable whose density is given by p(x = sell, 
What is the mean and the standard deviation of this function? As a hint, ac xe * dx=1 


and Hee x2e7* dx =2. 
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3. I walk up to you on the street and say “I have a random variable with mean 1, standard 
deviation 2, and I observed 25% of my samples taking a value larger than 9.” Do you 
believe me? Why or why not? 


4. Suppose that you have two random variables X, Y, with joint density given by pxy (x, y) = 
4xy for x, y € [0,1] and pxy(x, y) = 0 otherwise. What is the covariance of X and Y? 


Discussions?°? 


A.7 Maximum Likelihood 
a ee, 


One of the most commonly encountered way of thinking in machine learning is the maxi- 
mum likelihood point of view. This is the concept that when working with a probabilistic 
model with unknown parameters, the parameters which make the data have the highest 
probability are the most likely ones. 


A.7.1 The Maximum Likelihood Principle 


This has a Bayesian interpretation which can be helpful to think about. Suppose that we 
have a model with parameters 0 and a collection of data examples X. For concreteness, we 
can imagine that @ is a single value representing the probability that a coin comes up heads 
when flipped, and X is a sequence of independent coin flips. We will look at this example 
in depth later. 


If we want to find the most likely value for the parameters of our model, that means we 
want to find 


argmax P(6 | X). (A.1) 


By Bayes’ rule, this is the same thing as 


P(X | 0)P(0) 

P(X) | 
The expression P(X), a parameter agnostic probability of generating the data, does not 
depend on @ at all, and so can be dropped without changing the best choice of 0. Similarly, 
we may now posit that we have no prior assumption on which set of parameters are better 
than any others, so we may declare that P(@) does not depend on theta either! This, for 
instance, makes sense in our coin flipping example where the probability it comes up heads 
could be any value in [0, 1] without any prior belief it is fair or not (often referred to as an 
uninformative prior). Thus we see that our application of Bayes’ rule shows that our best 
choice of @ is the maximum likelihood estimate for 6: 


(A.2) 


argmax 


6 = argmax P(X | 0). (A.3) 
o 


As a matter of common terminology, the probability of the data given the parameters (P(X | 
0)) is referred to as the likelihood. 
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A Concrete Example 


Let’s see how this works in a concrete example. Suppose that we have a single parameter 0 
representing the probability that a coin flip is heads. Then the probability of getting a tails 
is 1 — 0, and so if our observed data X is a sequence with ny heads and ny tails, we can 
use the fact that independent probabilities multiply to see that 


P(X | 0) =0""(1-6)"", (A.4) 


If we flip 13 coins and get the sequence “HHHTHTTHHHHHT”, which has ny = 9 and 
nr = 4, we see that this is 


P(X |) =@(1-0)*. (A.5) 


One nice thing about this example will be that we know the answer going in. Indeed, if 
we said verbally, “I flipped 13 coins, and 9 came up heads, what is our best guess for the 
probability that the coin comes us heads?,” everyone would correctly guess 9/13. What this 
maximum likelihood method will give us is a way to get that number from first principals 
in a way that will generalize to vastly more complex situations. 


For our example, the plot of P(X | 8) is as follows: 


%matplotlib inline 
import torch 
from d21 import torch as d21 


theta = torch.arange(@, 1, 0.001) 
p = thetax*x9 x (1 - theta)«x*4. 


d21.plot(theta, p, ‘theta’, 'likelihood’) 


0.0003 4 


0.0002 4 


likelihood 


0.0001 4 


0.0000 + 


0.0 0.2 0.4 0.6 0.8 1.0 
theta 


This has its maximum value somewhere near our expected 9/13 ~ 0.7.... To see if it 
is exactly there, we can turn to calculus. Notice that at the maximum, the gradient of the 
function is flat. Thus, we could find the maximum likelihood estimate (A.1) by finding 
the values of 6 where the derivative is zero, and finding the one that gives the highest 


Maximum Likelihood 


probability. We compute: 


d 
o= 5, P(X | 4) 
d 
= ge =)" (A.6) 


= 9981 — 6)* - 46°(1 - 9)° 
= 6°(1 — 0)° (9 — 138). 
This has three solutions: 0, 1 and 9/13. The first two are clearly minima, not maxima as 


they assign probability 0 to our sequence. The final value does not assign zero probability 
to our sequence, and thus must be the maximum likelihood estimate 6 = 9/13. 


A.7.2 Numerical Optimization and the Negative Log-Likelihood 


The previous example is nice, but what if we have billions of parameters and data exam- 
ples? 


First, notice that if we make the assumption that all the data examples are independent, we 
can no longer practically consider the likelihood itself as it is a product of many probabili- 
ties. Indeed, each probability is in [0, 1], say typically of value about 1/2, and the product of 
(1 /2) 1000000000 is far below machine precision. We cannot work with that directly. 


However, recall that the logarithm turns products to sums, in which case 
log ( (1/2) 1000000000) — 1000000000 - log(1/2) ~ -301029995.6... (A.7) 


This number fits perfectly within even a single precision 32-bit float. Thus, we should 
consider the log-likelihood, which is 


log(P(X | 0)). (A.8) 


Since the function x + log(x) is increasing, maximizing the likelihood is the same thing 
as maximizing the log-likelihood. Indeed in Section A.9 we will see this reasoning applied 
when working with the specific example of the naive Bayes classifier. 


We often work with loss functions, where we wish to minimize the loss. We may turn 
maximum likelihood into the minimization of a loss by taking —log(P(X | 6)), which is 
the negative log-likelihood. 


To illustrate this, consider the coin flipping problem from before, and pretend that we do 
not know the closed form solution. We may compute that 


—log(P(X | @)) = -log(6"" (1 — 6)"") = —(ny log(@) + nr log(1 - @)). (A.9) 
This can be written into code, and freely optimized even for billions of coin flips. 


# Set up our data 
n_H 8675309 
n_T 256245 


(continues on next page) 
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(continued from previous page) 


# Initialize our paramteres 
theta = torch.tensor(@.5, requires_grad=True) 


# Perform gradient descent 
lr = 1e-9 
for iter in range(100): 
loss = -(n_H * torch. log(theta) + n_T * torch.log(1 - theta)) 
loss. backward() 
with torch.no_grad(): 
theta -= lr * theta.grad 
theta. grad.zero_() 


# Check output 
theta, n_H / (n_H + n_T) 


(tensor(@.9713, requires_grad=True), 2.9713101437890875) 
Numerical convenience is not the only reason why people like to use negative log-likelihoods. 
There are several other reasons why it is preferable. 


The second reason we consider the log-likelihood is the simplified application of calcu- 
lus rules. As discussed above, due to independence assumptions, most probabilities we 
encounter in machine learning are products of individual probabilities. 


P(X | 0) = p(x1 | 0) - p(x2 | 0): +: p(n | 0). (A.10) 


This means that if we directly apply the product rule to compute a derivative we get 
= p(x | 6) = (2 Pe | 6))- Pæ Pin | 8) 
a =|— P(x . P(x <- P(Xp 
ô0 aa ~ : 


+ P(x | 6)- Fee | 0)) Pen | 8) a 
.11 


+ P(x: 10) P(x | D | SGPC | 9): 


This requires n(n — 1) multiplications, along with (n — 1) additions, so it is proportional 
to quadratic time in the inputs! Sufficient cleverness in grouping terms will reduce this 
to linear time, but it requires some thought. For the negative log-likelihood we have in- 
stead 


-log (P(X | @)) = —log(P(x1 | 4) — log(P(x2 | 0)) ++- — log(P(xn | @)),  (A.12) 


which then gives 


0 1 0 1 0 
-5 log (P(X | 8) = Bo (spree | o) ee (spt | 0) l 
(A.13) 


983 


Maximum Likelihood 


This requires only n divides and n — | sums, and thus is linear time in the inputs. 


The third and final reason to consider the negative log-likelihood is the relationship to 
information theory, which we will discuss in detail in Section A.11. This is a rigorous 
mathematical theory which gives a way to measure the degree of information or randomness 
in arandom variable. The key object of study in that field is the entropy which is 


H(p) = - Pi log, (pi): (A.14) 


which measures the randomness of a source. Notice that this is nothing more than the av- 
erage — log probability, and thus if we take our negative log-likelihood and divide by the 
number of data examples, we get a relative of entropy known as cross-entropy. This theoret- 
ical interpretation alone would be sufficiently compelling to motivate reporting the average 
negative log-likelihood over the dataset as a way of measuring model performance. 


A.7.3 Maximum Likelihood for Continuous Variables 


Everything that we have done so far assumes we are working with discrete random vari- 
ables, but what if we want to work with continuous ones? 


The short summary is that nothing at all changes, except we replace all the instances of the 
probability with the probability density. Recalling that we write densities with lower case 
p, this means that for example we now say 


— log (p(X | 0)) = —log(p(x1 | @)) — log(p(x2 | 8)) ++» — log(p (xn | 0)) = - Yi log(p(ai | @)). 
l (A.15) 


The question becomes, “Why is this OK?” After all, the reason we introduced densities was 
because probabilities of getting specific outcomes themselves was zero, and thus is not the 
probability of generating our data for any set of parameters zero? 


Indeed, this is the case, and understanding why we can shift to densities is an exercise in 
tracing what happens to the epsilons. 


Let’s first re-define our goal. Suppose that for continuous random variables we no longer 
want to compute the probability of getting exactly the right value, but instead matching to 
within some range e. For simplicity, we assume our data is repeated observations x1, . . . , XN 
of identically distributed random variables X1, ..., Xy. As we have seen previously, this 
can be written as 


P(X, € [x1,%1 +€], Xo € [x2,x2 +€],..., Xn € [xn xn +€] | 0) 


ie (A.16) 
xe p(x1 | 8) - p(x2 | 8) ++: pn | 9). 
Thus, if we take negative logarithms of this we obtain 
—log(P(X1 € [x1, x1 + €], X2 € [x2, x2 + €],..., Xn € [xn xn + €] | 0)) 
(A.17) 


~ — Nlog(e) — $| log(p(x | 8)). 


If we examine this expression, the only place that the e occurs is in the additive constant 
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—N log(e). This does not depend on the parameters @ at all, so the optimal choice of 6 does 
not depend on our choice of e! If we demand four digits or four-hundred, the best choice 
of @ remains the same, thus we may freely drop the epsilon to see that what we want to 
optimize is 


- Di log(p(a | 6)). (A.18) 


Thus, we see that the maximum likelihood point of view can operate with continuous ran- 
dom variables as easily as with discrete ones by replacing the probabilities with probability 
densities. 


A.7.4 Summary 


e The maximum likelihood principle tells us that the best fit model for a given dataset is 
the one that generates the data with the highest probability. 


e Often people work with the negative log-likelihood instead for a variety of reasons: nu- 
merical stability, conversion of products to sums (and the resulting simplification of 
gradient computations), and theoretical ties to information theory. 


e While simplest to motivate in the discrete setting, it may be freely generalized to the 
continuous setting as well by maximizing the probability density assigned to the dat- 
apoints. 


A.7.5 Exercises 


1. Suppose that you know that a non-negative random variable has density ae” °* for some 
value œ > 0. You obtain a single observation from the random variable which is the 
number 3. What is the maximum likelihood estimate for a? 


2. Suppose that you have a dataset of samples {x; na , drawn from a Gaussian with unknown 
mean, but variance 1. What is the maximum likelihood estimate for the mean? 


Discussions ?°° . 


A.8 Distributions 
>>> 


Now that we have learned how to work with probability in both the discrete and the contin- 
uous setting, let’s get to know some of the common distributions encountered. Depending 
on the area of machine learning, we may need to be familiar with vastly more of these, or 
for some areas of deep learning potentially none at all. This is, however, a good basic list 
to be familiar with. Let’s first import some common libraries. 
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%matplotlib inline 

from math import erf, factorial 
import torch 

from IPython import display 
from d21 import torch as d21 


torch.pi = torch.acos(torch.zeros(1)) * 2 # Define pi in torch 


A.8.1 Bernoulli 


This is the simplest random variable usually encountered. This random variable encodes a 
coin flip which comes up 1 with probability p and O with probability 1 — p. If we have a 
random variable X with this distribution, we will write 


X ~ Bernoulli(p). (A.1) 


The cumulative distribution function is 


0 x <0, 
F(x)=4l-p O<x<1l, (A.2) 
1 x >=1. 


The probability mass function is plotted below. 


p = 0.3 


d21.set_figsize() 

d21.plt.stem(L@, 1], [1 - p, p], use_line_collection=True) 
d21.plt.xlabel(’x') 

d21.plt.ylabel('p.m.f.') 

d21.plt.show() 


Now, let’s plot the cumulative distribution function (A.2). 


x = torch.arange(-1, 2, 0.01) 


def F(x): 
return @ if x < @ else 1 if x > 1 else 1-p 


(continues on next page) 
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d21.plot(x, torch.tensor([F(y) for y in x]), ‘x’, ‘'c.d.f.’) 


If X ~ Bernoulli(p), then: 
e H x= P., 
Zo 
e of =p -p). 
We can sample an array of arbitrary shape from a Bernoulli random variable as follows. 


1x(torch.rand(10, 10) < p) 


tensor([[@, 1, ©, ©, 1, ©, ð, 0, 0, @], 
[0, 1, 0, 0, ©, ©, 1, ð, ð, 1, 
[0, 1, ©, 0, 1, @, 0, @, @, 1], 
[1, 0, 0, 0, ©, ð, 0, ð, ð, 2], 
[0, 0, © 1, @, ®, 1, @, @, 1], 
CO, 0, ©, ©, ©, ©, 1, 1, @, 0J, 
El, i, @, @, 4, 2. 1, 1, 1, A 
[1, 0, ©, ©, 1, ð, 1, 1, @, 0], 
[0, 0, ©, 0, 1, ©, 0, ©, 2, J, 
[1, ®, 1, 1, 1, 1, ®, 1, @, @]]) 


A.8.2 Discrete Uniform 


The next commonly encountered random variable is a discrete uniform. For our discussion 
here, we will assume that it is supported on the integers {1,2,...,}, however any other 
set of values can be freely chosen. The meaning of the word uniform in this context is that 
every possible value is equally likely. The probability for each value i € {1,2,3,...,n} is 
Pi= L. We will denote a random variable X with this distribution as 


X~ U(n). (A.3) 
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The cumulative distribution function is 


0 x<l, 


F(x) = k<x<k+I1withl<k <n, 


zl 


1 x>=n. 


Let’s first plot the probability mass function. 


n=5 


d21.plt.stem([i+1 for i in range(n)], n*{1 / n], use_line_collection=True) 
d21.plt.xlabel('x') 

d21.plt.ylabel('p.m.f.') 

d21.plt.show() 


| 
N 
w 
A 
ud 


Now, let’s plot the cumulative distribution function (A.4). 


x = torch.arange(-1, 6, 0.01) 


def F(x): 
return ð if x < 1 else 1 if x > n else torch.floor(x) / n 


d21.plot(x, torch.tensor([F(y) for y in x]), ‘x’, 'c.d.f.’) 


If X ~ U(n), then: 


l+n 
S HSE a 
2 _ n-l 


e 
sa 
II 
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We can sample an array of arbitrary shape from a discrete uniform random variable as 
follows. 


torch.randint(1, n, size=(10, 10)) 


tensor([[1, 4, 3, 2, 1, 1, 3, 1, 1, 4], 
[4 da a A, A E 
[2 4-45. 1, Ay Bay By 2 Dd; 
iis 2.36 Wy 1, AS 2 AL a; 
Fl. 2s A. We Ae Bt Se De A 
[ee Be De Be O De Be Din 3, 
Fas. 2. Sy. 4. 3°35. Ay a a a 
E35. ds. S 4.45, 2 3. 1. 4. 21), 
[2 2 Ay By 4,25. 35 4-2, 4], 
[1.4.3.3 2.3. 34, 1,37) 


A.8.3 Continuous Uniform 


Next, let’s discuss the continuous uniform distribution. The idea behind this random vari- 
able is that if we increase the n in the discrete uniform distribution, and then scale it to fit 
within the interval [a, b], we will approach a continuous random variable that just picks 
an arbitrary value in [a,b] all with equal probability. We will denote this distribution 
as 


X ~ U(a,b). (A.5) 
The probability density function is 
—— xe (abl, 
pa) =; (A.6) 
0 x ¢ [a,b]. 


The cumulative distribution function is 


0 x <a, 
F(x) = 43-4 xe [a,b], (A.7) 
1 x>=b 


Let’s first plot the probability density function (A.6). 


a be= 1s 
x = torch.arange(@, 4, 0.01) 


p = (x > a).type(torch. float32)*(x < b).type(torch. float32)/(b-a) 
CALA (ollie, (9, 85 “Dalai? . 79) 


Now, let’s plot the cumulative distribution function (A.7). 
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def F(x): 
return @ if x < a else 1 if x > b else (x - a) / (b - a) 


d21.plot(x, torch.tensor([F(y) for y in x]), ‘x’, ‘c.d.f.") 


1.05 


o4 
m 
x N 
w 
a4 


If X ~ U(a, b), then: 


+b 

e ux= ae 
2 _ (b-a)? 
e 0x5 >m 


We can sample an array of arbitrary shape from a uniform random variable as follows. Note 
that it by default samples from a U (0, 1), so if we want a different range we need to scale 
it. 


(b - a) * torch.rand(10, 10) + a 


tensor([[2.4857, 2.2461, 1.6809, 2.7434, 2.7072, 2.6190, 1.4883, 1.2517, 1. 
«3454, 

2.4754], 

[1.0974, 1.5680, 1.8788, 2.8231, 2.1695, 2.6461, 1.4914, 1.4887, 1. 
«3860, 

1.9090], 

[1.3746, 1.7773, 1.2412, 1.1950, 2.7281, 2.8356, 1.2266, 2.4724, 2. 
«4641, 

2.8991], 

[2.4018, 2.6727, 1.0308, 1.1951, 1.9390, 1.6486, 2.8314, 1.1025, 1. 


(continues on next page) 
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3354, 

1.0130], 

[1.1281, 1.8000, 2.3788, 2.6580, 1.6750, 2.2081, 1.2705, 1.0757, 2. 
«53311, 

2.6557], 

[2.9912, 1.2263, 1.8115, 1.5940, 1.9321, 1.6469, 2.2990, 2.1473, 1. 
8165, 

1.2806], 

[1.1672, 1.1536, 1.9649, 2.1655, 1.7170, 1.0284, 1.3305, 2.1904, 1. 
4036, 

2.1958], 

[2.5891, 2.5840, 2.2679, 2.0687, 2.9249, 1.6741, 1.2238, 2.4463, 2. 
+2235, 

2.7038], 

[1.8697, 2.4965, 1.5785, 2.7890, 2.3319, 2.1434, 2.3333, 1.0286, 1. 
«9245, 

1.7640], 

[1.2504, 1.7558, 1.4322, 1.5226, 1.3380, 1.1388, 1.8707, 2.2330, 2. 
a3818; 

2.208711) 


A.8.4 Binomial 


Let’s make things a little more complex and examine the binomial random variable. This 
random variable originates from performing a sequence of n independent experiments, each 
of which has probability p of succeeding, and asking how many successes we expect to 
see. 


Let’s express this mathematically. Each experiment is an independent random variable X; 
where we will use | to encode success, and 0 to encode failure. Since each is an independent 
coin flip which is successful with probability p, we can say that X; ~ Bernoulli(p). Then, 
the binomial random variable is 


Xay X (A.8) 
i=l 
In this case, we will write 
X ~ Binomial(n, p). (A.9) 


To get the cumulative distribution function, we need to notice that getting exactly k suc- 
cesses can occur in a) = Wom ways each of which has a probability of p*(1 — p)"~* 


of occurring. Thus the cumulative distribution function is 


0 x <0, 
F(x) = L mek (ip (l=p" k<x<k+1withO<k <n, (A.10) 
1 x >=n. 


Let’s first plot the probability mass function. 
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n, p= 10, 0.2 


# Compute binomial coefficient 
def binom(n, k): 
comb = 1 
for i in range(min(k, n - k)): 
comb = comb * (n - i) // (i + 1) 
return comb 


pmf = torch.tensor([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)]) 


d21.plt.stem(Li for i in range(n + 1)], pmf, use_line_collection=True) 
d21.plt.xlabel('x') 

d21.plt.ylabel('p.m.f.') 

d21.plt.show() 


0.34 


0.2 4 


p.m.f. 


0.14 
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Now, let’s plot the cumulative distribution function (A.10). 


x = torch.arange(-1, 11, 0.01) 
cmf = torch.cumsum(pmf, dim=@) 


def F(x): 
return @ if x < @ else 1 if x > n else cmf[int(x)] 


d21.plot(x, torch.tensor([F(y) for y in x.tolist(Q)]), 'x', ‘c.d.f.') 


1.05 


0.8 4 


0.6 4 


c.d.f. 


0.44 


0.24 


0.0 4 


If X ~ Binomial (n, p), then: 


e ux =np, 
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° T: =np(l- p). 


This follows from the linearity of expected value over the sum of n Bernoulli random vari- 
ables, and the fact that the variance of the sum of independent random variables is the sum 
of the variances. This can be sampled as follows. 


m = torch.distributions.binomial.Binomial(n, p) 
m. sample(sample_shape=(10, 10)) 


tensor (LG... 325. Aar Bi; Bs; les Sadas Be Se, 
a ke Bey Bi Bee Ze a Say Be Aa 
[Osi Deg Or Bug Oey Bey Ve, Bes tay del, 
Chee Bio Ses, Wing Pa Boe Bo Be Sey Qe 
U a Bi Beg Daas Sic Bos Seg Paw Cul 
Z, Or; Bey Dey Bey Vey dey k Beg ded, 
E Tos 3e Dy dar 2e ee ee eee O; 
DA 3i Zos Bes Bag Beg Bos Bag Rag al 
Pie Bee Be Bing Wy Bix, Wea Dee Bey, Al, 
Ti Bis Den Vey Box Bi, Hes By Be, 2a 


A.8.5 Poisson 


Let’s now perform a thought experiment. We are standing at a bus stop and we want to 
know how many buses will arrive in the next minute. Let’s start by considering X“!) ~ 
Bernoulli(p) which is simply the probability that a bus arrives in the one minute window. 
For bus stops far from an urban center, this might be a pretty good approximation. We may 
never see more than one bus in a minute. 


However, if we are in a busy area, it is possible or even likely that two buses will arrive. 
We can model this by splitting our random variable into two parts for the first 30 seconds, 
or the second 30 seconds. In this case we can write 


XO a XP, (A.11) 


where X‘?) is the total sum, and X ~ Bernoulli(p/2). The total distribution is then 
X() ~ Binomial(2, p/2). 


Why stop here? Let’s continue to split that minute into n parts. By the same reasoning as 
above, we see that 


X™ ~ Binomial(n, p/n). (A.12) 


Consider these random variables. By the previous section, we know that (A.12) has mean 


Lyin) = n(p/n) = p, and variance Oa = n(p/n)(1 - (p/n)) = p(1 — p/n). If we take 


n — oo, we can see that these numbers stabilize to y(0) = p, and variance Os = p. This 
indicates that there could be some random variable we can define in this infinite subdivision 


limit. 


This should not come as too much of a surprise, since in the real world we can just count 
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the number of bus arrivals, however it is nice to see that our mathematical model is well 
defined. This discussion can be made formal as the law of rare events. 


Following through this reasoning carefully, we can arrive at the following model. We will 
say that X ~ Poisson(A) if it is a random variable which takes the values {0, 1, 2,...} with 
probability 

ae 
kl 
The value 2 > 0 is known as the rate (or the shape parameter), and denotes the average 
number of arrivals we expect in one unit of time. 


we (A.13) 


We may sum this probability mass function to get the cumulative distribution function. 


F(x) x <0, 
x = m 
eak + k<x<k+l1with0< k. 


m=0 m! 


(A.14) 


Let’s first plot the probability mass function (A.13). 


lam = 5.0 


xs = [i for i in range(20)] 
pmf = torch.tensor([torch.exp(torch.tensor(-lam)) * lam*xk 
/ factorial(k) for k in xs]) 


d21.plt.stem(xs, pmf, use_line_collection=True) 
d21.plt.xlabel('x') 

d21.plt.ylabel('p.m.f.') 

d21.plt.show() 


Now, let’s plot the cumulative distribution function (A.14). 


x = torch.arange(-1, 21, @.01) 
cmf = torch.cumsum(pmf, dim=Q) 
def F(x): 
return @ if x < @ else 1 if x > n else cmf[int(x)] 


d21.plot(x, torch.tensor(L[F(y) for y in x.tolistQ)]), ’'x’, ‘'c.d.f.’) 


As we saw above, the means and variances are particularly concise. If X ~ Poisson(A), 
then: 
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e ix =A, 
e oo =a 


This can be sampled as follows. 


m = torch.distributions.poisson.Poisson(lam) 
m.sample((10, 10)) 


tensor([[ 1., 4., 6., 8., 4., 4., 4., 7., 6., 4.], 
Coa Beg Tas Tear Bee Wes Te Je Ber Ade 
Eaa day Bis Boy Ae Ba Ba. Be. Tee Bh 
[4., 3., 4., 10., 8, 6, 4., 6., 5., 5.1, 
ES May es Boy Bon Be Bea a, Be Sk 
i re ce er rr ne a D ee eee 
C Bese Gey Fas Dep Bee “Bear Bae Bey Bey 10], 
Eee Ae Bex “Fee Ses. Bey Fes Bes Sey el, 
Si. Bis Big By Oe: Big See. Bia Des, Bel, 
E3 12, 9, 23.5 24 Bi, SB, Bs Ss. SI 


A.8.6 Gaussian 


Now Let’s try a different, but related experiment. Lets say we again are performing n 
independent Bernoulli(p) measurements X;. The distribution of the sum of these is X™ ~ 
Binomial(n, p). Rather than taking a limit as n increases and p decreases, Let’s fix p, and 
then send n — ov. In this case uyn) = np — œ and orn) =np(1 — p) > œ, so there is 
no reason to think this limit should be well defined. 


However, not all hope is lost! Let’s just make the mean and variance be well behaved by 
defining 


x") _ x(n) 


y™ = (A158) 


Oyn) 


This can be seen to have mean zero and variance one, and so it is plausible to believe that 
it will converge to some limiting distribution. If we plot what these distributions look like, 
we will become even more convinced that it will work. 
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p = 0.2 
ns = [1, 10, 100, 1000] 
d21.plt.figure(figsize=(10, 3)) 
for i in range(4): 
n = ns[i] 
pmf = torch.tensor([p**i * (1-p)**(n-i) * binom(n, i) 
for i in range(n + 1)]) 
d21.plt.subplot(1, 4, i + 1) 
d21.plt.stem([(i - n*p)/torch.sqrt(torch.tensor(nsp*(1 - p))) 
for i in range(n + 1)], pmf, 
use_line_collection=True) 
d21.plt.xlim([-4, 4]) 
d21.plt.xlabel('x') 
d21.plt.ylabel('p.m.f.’) 
d21.plt.title("n = {}”".format(n)) 
d21.plt.show() 


n=1 n=10 n= 100 n = 1000 
0.8 | d.30 4 
0.254 
0.64 
0.20 Ņ4 
= = 
£0.44 £ 0.15 4 
a a 
qg.104 
0.24 
q.05 4 
0.04 0.00 4 
-4 -2 0 2 4 -4 -2 0 2 
x x x 


One thing to note: compared to the Poisson case, we are now dividing by the standard de- 
viation which means that we are squeezing the possible outcomes into smaller and smaller 
areas. This is an indication that our limit will no longer be discrete, but rather continu- 
ous. 


A derivation of what occurs is beyond the scope of this document, but the central limit 
theorem states that as n — oo, this will yield the Gaussian Distribution (or sometimes 
normal distribution). More explicitly, for any a, b: 


lim P(Y™ e [a,b]) = P(N(O, 1) € [a, b]), (A.16) 


n—> 0 


where we say a random variable is normally distributed with given mean u and variance 
a°, written X ~ N(, o°) if X has density 


_ Gp)? 


e 202 i (A.17) 


Px(x) = 


N 


2no 


Let’s first plot the probability density function (A.17). 
mu, sigma = ð, 1 


(continues on next page) 
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x = torch.arange(-3, 3, 0.01) 
p = 1 / torch.sqrt(2 * torch.pi * sigmax*2) * torch.exp( 
-(x - mu)**2 / (2 x sigmax*2)) 


AE O [By 8 5 “oGh ifs”) 


0.4 4 


Now, let’s plot the cumulative distribution function. It is beyond the scope of this ap- 
pendix, but the Gaussian c.d.f. does not have a closed-form formula in terms of more 
elementary functions. We will use erf which provides a way to compute this integral nu- 
merically. 


def phi(x): 
return (1.0 + erf((x - mu) / (sigma * torch.sqrt(torch.tensor(2.))))) / 2.0 


d21.plot(x, torch.tensor([phi(y) for y in x.tolistQ)]), ‘x’, ‘c.d.f.') 


Keen-eyed readers will recognize some of these terms. Indeed, we encountered this integral 
in Section A.5. Indeed we need exactly that computation to see that this px (x) has total 
area one and is thus a valid density. 


Our choice of working with coin flips made computations shorter, but nothing about that 
choice was fundamental. Indeed, if we take any collection of independent identically dis- 
tributed random variables X;, and form 


N 
Re SX (A.18) 
i=] 
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Then 
xN) Hxw) 
OXN) 


(A.19) 


will be approximately Gaussian. There are additional requirements needed to make it work, 
most commonly E[X*] < oo, but the philosophy is clear. 


The central limit theorem is the reason why the Gaussian is fundamental to probability, 
Statistics, and machine learning. Whenever we can say that something we measured is a 
sum of many small independent contributions, we can assume that the thing being measured 
will be close to Gaussian. 


There are many more fascinating properties of Gaussians, and we would like to discuss one 
more here. The Gaussian is what is known as a maximum entropy distribution. We will 
get into entropy more deeply in Section A.11, however all we need to know at this point 
is that it is a measure of randomness. In a rigorous mathematical sense, we can think of 
the Gaussian as the most random choice of random variable with fixed mean and variance. 
Thus, if we know that our random variable has some mean and variance, the Gaussian is in 
a sense the most conservative choice of distribution we can make. 


To close the section, let’s recall that if X ~ N (u, 07), then: 


e HX ZH, 
e eo. 


We can sample from the Gaussian (or standard normal) distribution as shown below. 


torch.normal(mu, sigma, size=(10, 10)) 


tensor([[L 1.3588, 0.0473, -1.5805, -@.0108, 0.4253, 0.7924, -@.6547, 0. 
237313, 

-0.3038, 1.1935], 

[ 0.0089, 0.8951, 1.0055, 0.0956, -1.1109, -0.6342, 1.6772, 1. 


0314, 
0.3819, -1.7822], 
[-0.0604, -1.0318, @.9113, 1.3118, -1.8370, -0.9023, 1.0365, ð. 
«9052, 
-0.6411, -0.8949], 
[-0.1713, -0.2347, 0.0767, -0.6375, -@.4612, -1.6875, -0.1570, 1. 
«0591, 
0.8377, 0.5097], 
[ 0.2762, -@.6213, -0.3422, 0.9449, -@.7544, -0.2150, 1.0240, 1. 
«0253, 
-0.9182, 1.1536], 
[ 0.0614, @.2758, -@.3610, -1.0577, -0.5513, -0.9158, 0.7539, ®. 
«9204, 
-0.5908, 0.9113], 
[ 1.6190, -0.9213, -0.7944, -2.2621, @.5826, -1.8287, 1.4097, -ð. 
5744, 


-0.0668, 1.2074], 
[-0.0624, 0.1928, 1.3002, 0.6756, 1.1590, 1.0144, 1.1840, -ð. 


(continues on next page) 
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(continued from previous page) 


5010, 
0.6026, -0.7722], 
[-2.0148, 0.6958, 0.9940, 0.8477, 1.0957, -0.5253, 0.2353, -0. 
«2663, 
1.2275, 0.5993], 
[ 0.4651, -0.8218, -0.5441, -2.0338, -0.6930, -0.0674, -0.4448, -ð. 
«8397, 


0.0360, -@.7089]]) 


A.8.7 Exponential Family 


One shared property for all the distributions listed above is that they all belong to which 
is known as the exponential family. The exponential family is a set of distributions whose 
density can be expressed in the following form: 


p(x | n) = h(x) - exp (n° - T(x) - A(q)) (A.20) 
As this definition can be a little subtle, let’s examine it closely. 


First, A(x) is known as the underlying measure or the base measure. This can be viewed 
as an original choice of measure we are modifying with our exponential weight. 


Second, we have the vector 7 = (71,172,.-..71) € R! called the natural parameters or 
canonical parameters. These define how the base measure will be modified. The natural 
parameters enter into the new measure by taking the dot product of these parameters against 
some function T(-) of x = (x%1,%2,...,X,) E€ R” and exponentiated. The vector T(x) = 
(T, (x), P(X), ..., 7 (x)) is called the sufficient statistics for n. This name is used since the 
information represented by T(x) is sufficient to calculate the probability density and no 
other information from the sample x’s are required. 


Third, we have A(7), which is referred to as the cumulant function, which ensures that the 
above distribution (A.20) integrates to one, i.e., 


A(n) = log : (A.21) 


if h(x) - exp (n' - T(x)) dx 


To be concrete, let’s consider the Gaussian. Assuming that x is an univariate variable, we 
saw that it had a density of 


p(x | m0) = ——- 
TO 


i (A.22) 


1 H 1 
om vexp|{ 4 — a - (r + og(o))] í 


This matches the definition of the exponential family with: 


e underlying measure: h(x) = T 
m 


m|_[& 
e natural parameters: n = | = | g? | 
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x 
o sufficient statistics: T(x) = | 2 


, and 


2 
e cumulant function: A(ņ) = sta? + log(a) = a - $ log(2m). 


Itis worth noting that the exact choice of each of above terms is somewhat arbitrary. Indeed, 


the important feature is that the distribution can be expressed in this form, not the exact form 
itself. 


As we allude to in Section 4.1.2, a widely used technique is to assume that the final output 
y follows an exponential family distribution. The exponential family is a common and 
powerful family of distributions encountered frequently in machine learning. 


A.8.8 Summary 


Bernoulli random variables can be used to model events with a yes/no outcome. 


Discrete uniform distributions model selects from a finite set of possibilities. 


Continuous uniform distributions select from an interval. 


Binomial distributions model a series of Bernoulli random variables, and count the num- 
ber of successes. 


Poisson random variables model the arrival of rare events. 


Gaussian random variables model the result of adding a large number of independent 
random variables together. 


All the above distributions belong to exponential family. 


A.8.9 Exercises 


1. What is the standard deviation of a random variable that is the difference X — Y of two 
independent binomial random variables X, Y ~ Binomial(16, 1/2). 


2. If we take a Poisson random variable X ~ Poisson(A) and consider (X — A)/VA as 
A — œ, we can show that this becomes approximately Gaussian. Why does this make 
sense? 


3. What is the probability mass function for a sum of two discrete uniform random variables 
on n elements? 


Discussions 2°" . 


A.9 Naive Bayes 
| 


Throughout the previous sections, we learned about the theory of probability and random 
variables. To put this theory to work, let’s introduce the naive Bayes classifier. This 
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uses nothing but probabilistic fundamentals to allow us to perform classification of dig- 
its. 


Learning is all about making assumptions. If we want to classify a new data example that we 
have never seen before we have to make some assumptions about which data examples are 
similar to each other. The naive Bayes classifier, a popular and remarkably clear algorithm, 
assumes all features are independent from each other to simplify the computation. In this 
section, we will apply this model to recognize characters in images. 


%matplotlib inline 

import math 

import torch 

import torchvision 

from d21 import torch as d21 


d21.use_svg_display() 


A.9.1 Optical Character Recognition 


MNIST (LeCun et al., 1998) is one of widely used datasets. It contains 60,000 images for 
training and 10,000 images for validation. Each image contains a handwritten digit from 0 
to 9. The task is classifying each image into the corresponding digit. 


Gluon provides a MNIST class in the data. vision module to automatically retrieve the 
dataset from the Internet. Subsequently, Gluon will use the already-downloaded local copy. 
We specify whether we are requesting the training set or the test set by setting the value of 
the parameter train to True or False, respectively. Each image is a grayscale image with 
both width and height of 28 with shape (28,28,1). We use a customized transformation to 
remove the last channel dimension. In addition, the dataset represents each pixel by an un- 
signed 8-bit integer. We quantize them into binary features to simplify the problem. 


data_transform = torchvision. transforms .Compose(L 
torchvision. transforms. ToTensor(), 
lambda x: torch.floor(x * 255 / 128).squeeze(dim=0) 
D 


mnist_train = torchvision.datasets.MNIST( 

root='./temp', train=True, transform=data_transform, download=True) 
mnist_test = torchvision.datasets.MNIST( 

root='./temp', train=False, transform=data_transform, download=True) 


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz 
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./ 
<temp/MNIST/raw/train-images-idx3-ubyte. gz 

100% | | 9912422/9912422 [00:00<00:00, 115752065.81it/s] 

Extracting ./temp/MNIST/raw/train-images-idx3-ubyte.gz to ./temp/MNIST/raw 


Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte. gz 
Downloading http://yann. lecun.com/exdb/mnist/train-labels-idxl-ubyte.gz to ./ 


(continues on next page) 
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(continued from previous page) 


<temp/MNIST/raw/train-labels-idx1-ubyte. gz 
100%| | 28881/28881 [00:00<00:00, 5234904.66it/s] 
Extracting ./temp/MNIST/raw/train-labels-idxl-ubyte.gz to ./temp/MNIST/raw 


Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte. gz 
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./ 
<—temp/MNIST/raw/t10k-images-idx3-ubyte. gz 


100%| | 1648877/1648877 [00:00<00:00, 43715298.68it/s]Extracting ./ 
<temp/MNIST/raw/t10k-images-idx3-ubyte.gz to ./temp/MNIST/raw 


Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte. gz 
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idxl-ubyte.gz to ./ 
<temp/MNIST/raw/t10k-labels-idx1-ubyte. gz 

100%| | 4542/4542 [00:00<00:00, 21501725.47it/s] 

Extracting ./temp/MNIST/raw/t10k-labels-idxl-ubyte.gz to ./temp/MNIST/raw 


We can access a particular example, which contains the image and the corresponding la- 
bel. 


image, label = mnist_train[2] 
image.shape, label 


(torch.Size([28, 28]), 4) 


Our example, stored here in the variable image, corresponds to an image with a height and 
width of 28 pixels. 


image.shape, image.dtype 


(torch.Size([28, 28]), torch. float32) 


Our code stores the label of each image as a scalar. Its type is a 32-bit integer. 


label, type(label) 


(4, int) 


We can also access multiple examples at the same time. 


images = torch.stack([mnist_train[i][0] for i in range(10, 38)], dim=0) 
labels = torch.tensor(L[mnist_train[i][1] for i in range(10, 38)]) 
images.shape, labels.shape 
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(torch.Size([28, 28, 28]), torch.Size([28])) 


Let’s visualize these examples. 


d21.show_images(images, 2, 9); 


A.9.2 The Probabilistic Model for Classification 


In a classification task, we map an example into a category. Here an example is a grayscale 
28 x 28 image, and a category is a digit. (Refer to Section 4.1 for a more detailed expla- 
nation.) One natural way to express the classification task is via the probabilistic question: 
what is the most likely label given the features (i.e., image pixels)? Denote by x € Rf the 
features of the example and y € R the label. Here features are image pixels, where we can 
reshape a 2-dimensional image to a vector so that d = 28? = 784, and labels are digits. 
The probability of the label given the features is p(y | x). If we are able to compute these 
probabilities, which are p(y | x) for y = 0,...,9 in our example, then the classifier will 
output the prediction } given by the expression: 


§ = argmax p(y | x). (A.1) 


Unfortunately, this requires that we estimate p(y | x) for every value of x = x1, ..., Xd. 
Imagine that each feature could take one of 2 values. For example, the feature x; = 1 might 
signify that the word apple appears in a given document and x; = 0 would signify that it 
does not. If we had 30 such binary features, that would mean that we need to be prepared 
to classify any of 2% (over 1 billion!) possible values of the input vector x. 


Moreover, where is the learning? If we need to see every single possible example in order to 
predict the corresponding label then we are not really learning a pattern but just memorizing 
the dataset. 


A.9.3 The Naive Bayes Classifier 


Fortunately, by making some assumptions about conditional independence, we can intro- 
duce some inductive bias and build a model capable of generalizing from a comparatively 
modest selection of training examples. To begin, let’s use Bayes theorem, to express the 
classifier as 


p(x | VPO) (A.2) 


$ = argmax, p(y | x) = argmax 
ý ” pæ) 
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Note that the denominator is the normalizing term p(x) which does not depend on the value 
of the label y. As a result, we only need to worry about comparing the numerator across 
different values of y. Even if calculating the denominator turned out to be intractable, we 
could get away with ignoring it, so long as we could evaluate the numerator. Fortunately, 
even if we wanted to recover the normalizing constant, we could. We can always recover 
the normalization term since >}, p(y | x) = 1. 


Now, let’s focus on p(x | y). Using the chain rule of probability, we can express the term 
p(x | y) as 


P(x1 | y) + p | x1, y) +. + Pd | X15 Xa-1, Y). (A.3) 


By itself, this expression does not get us any further. We still must estimate roughly 2¢ 
parameters. However, if we assume that the features are conditionally independent of each 
other, given the label, then suddenly we are in much better shape, as this term simplifies to 
Il; p(x: | y), giving us the predictor 


d 
$ =argmax, | | p(s | ypo). (A.4) 
i=1 
If we can estimate p(x; = 1 | y) for every i and y, and save its value in Pyy |i, y], here Pyy 
is a d Xn matrix with n being the number of classes and y € {1,..., n}, then we can also 
use this to estimate p(x; = 0 | y), i.e., 


Pxyli, y] for t; = 1; 


(A.5) 
1- Pxyli,y] fort; =0. 


per=nin={ 
In addition, we estimate p(y) for every y and save it in Py[y], with P, a n-length vector. 
Then, for any new example t = (t1, t2, . . . , ta), we could compute 


d 

§ = argmaxy p(y) a P(r = ti | y) 
~ (A.6) 

- ers . 1-1; 

=argmax, Pyly] | | Pxyli yl" (1- Pxylé yl) 

i=1 

for any y. So our assumption of conditional independence has taken the complexity of 
our model from an exponential dependence on the number of features O(2¢n) to a linear 


dependence, which is O(dn). 


A.9.4 Training 


The problem now is that we do not know Py and Py. So we need to estimate their values 
given some training data first. This is training the model. Estimating P, is not too hard. 
Since we are only dealing with 10 classes, we may count the number of occurrences ny for 
each of the digits and divide it by the total amount of data n. For instance, if digit 8 occurs 
ng = 5,800 times and we have a total of n = 60,000 images, the probability estimate is 
p(y = 8) = 0.0967. 
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X = torch.stack([mnist_trainli][0] for i in range(len(mnist_train))], dim=0) 
Y = torch.tensor([mnist_train[i][1] for i in range(len(mnist_train))]) 
_y = torch.zeros(10) 
for y in range(10): 
n_yLy] = (Y == y).sum() 
P_y = n_y / n_y.sum() 
P_y 


tensor ([@.0987, 0.1124, 0.0993, 0.1022, 0.0974, 0.0904, 0.0986, 0.1044, 0.0975, 
@.0992]) 


Now on to slightly more difficult things P,,. Since we picked black and white images, 
P(x; | y) denotes the probability that pixel i is switched on for class y. Just like before we 
can go and count the number of times n;y such that an event occurs and divide it by the 
total number of occurrences of y, i.e., ny. But there is something slightly troubling: certain 
pixels may never be black (e.g., for well cropped images the corner pixels might always be 
white). A convenient way for statisticians to deal with this problem is to add pseudo counts 
to all occurrences. Hence, rather than n;y we use n;y + 1 and instead of n, we use ny, + 2 
(since there are two possible values pixel į can take - it can either be black or white). This 
is also called Laplace Smoothing. It may seem ad-hoc, however it can be motivated from a 
Bayesian point-of-view by a Beta-binomial model. 


n_x 
for 


torch.zeros((10, 28, 28)) 

in range(10): 

_xLy] = torch. tensor(X.numpy()LY.numpy() == y].sum(axis=0)) 
P_xy = (n_x + 1) / (n_y + 2).reshape(10, 1, 1) 


es Ii 


d21.show_images(P_xy, 2, 5); 


By visualizing these 10 x 28 x 28 probabilities (for each pixel for each class) we could get 
some mean looking digits. 


Now we can use (A.6) to predict a new image. Given x, the following functions computes 
p(x | y)p(y) for every y. 
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def bayes_pred(x): 
x = X.unsqueeze(Q@) # (28, 28) -> (1, 28, 28) 
Ly = Posey co se ce (Cl = IPsaypyadil = 30) 
p_xy = p_xy.reshape(10, -1).prod(dim=1) # p(x|y) 
return p_xy * P_y 


image, label = mnist_test[Q] 
bayes_pred(image) 


tensor([@., @., ©., @., @., ©., ©., ©., ©., @.]) 


This went horribly wrong! To find out why, let’s look at the per pixel probabilities. They 
are typically numbers between 0.001 and 1. We are multiplying 784 of them. At this point 
it is worth mentioning that we are calculating these numbers on a computer, hence with a 
fixed range for the exponent. What happens is that we experience numerical underflow, i.e., 
multiplying all the small numbers leads to something even smaller until it is rounded down 
to zero. We discussed this as a theoretical issue in Section A.7, but we see the phenomena 
clearly here in practice. 


As discussed in that section, we fix this by use the fact that logab = loga + log J, i.e., 
we switch to summing logarithms. Even if both a and b are small numbers, the logarithm 
values should be in a proper range. 


a=Q.1 
print(’underflow:'’, a**784) 
print(’logarithm is normal:', 784xmath.log(a)) 


underflow: 0.2 
logarithm is normal: -1805.2267129073316 


Since the logarithm is an increasing function, we can rewrite (A.6) as 


d 
§ =argmax, log Py[y] + >, |r log Px [x;y] + (1 — ti) log(1 - Pxy[xi,y])}. (A-7) 


i=1 


We can implement the following stable version: 


log_P_xy = torch.log(P_xy) 
log_P_xy_neg = torch.log(1 - P_xy) 
log_P_y = torch.log(P_y) 


def bayes_pred_stable(x): 
x = x.unsqueeze(0) # (28, 28) -> (1, 28, 28) 
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x) 
p_xy = p_xy.reshape(10, -1).sum(axis=1) # p(x|y) 
return p_xy + log_P_y 


py = bayes_pred_stable(image) 
py 
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tensor ([-268.9725, -301.7044, -245.1951, -218.8738, -193.4570, -206.0909, 
-292.5226, -114.6257, -220.3313, -163.1784]) 


We may now check if the prediction is correct. 


py.argmax(dim=0) == label 


tensor (True) 


If we now predict a few validation examples, we can see the Bayes classifier works pretty 
well. 


def predict(X): 
return [bayes_pred_stable(x) .argmax(dim=0) .type(torch. int32).item() 


for Se iliny 2 
X = torch.stack([mnist_test[i][0] for i in range(18)], dim=0) 
y = torch.tensor([mnist_test[i]L[1] for i in range(18)]) 


preds = predict(X) 
d21.show_images(X, 2, 9, titles=[str(d) for d in preds]); 


7 2 1 0 4 1 4 9 4 
9 0 6 9 ie) 1 3 9 7 


Finally, let’s compute the overall accuracy of the classifier. 


X = torch.stack([mnist_test[i][0] for i in range(len(mnist_test))], dim=0) 
y = torch.tensor([mnist_testLiJl1] for i in range(len(mnist_test))]) 
preds = torch.tensor(predict(X), dtype=torch. int32) 

float((preds == y).sum()) / len(y) # Validation accuracy 


0.8427 


Modern deep networks achieve error rates of less than 0.01. The relatively poor perfor- 
mance is due to the incorrect statistical assumptions that we made in our model: we as- 
sumed that each and every pixel are independently generated, depending only on the label. 
This is clearly not how humans write digits, and this wrong assumption led to the downfall 
of our overly naive (Bayes) classifier. 


A.9.5 Summary 
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Using Bayes’ rule, a classifier can be made by assuming all observed features are inde- 
pendent. 


This classifier can be trained on a dataset by counting the number of occurrences of 
combinations of labels and pixel values. 


This classifier was the gold standard for decades for tasks such as spam detection. 


A.9.6 Exercises 


1. Consider the dataset [[0, 0], [0, 1], [1,0], [1, 1]] with labels given by the XOR of the 
two elements [0, 1, 1,0]. What are the probabilities for a Naive Bayes classifier built 
on this dataset. Does it successfully classify our points? If not, what assumptions are 
violated? 


2. Suppose that we did not use Laplace smoothing when estimating probabilities and a 
data example arrived at testing time which contained a value never observed in training. 
What would the model output? 


3. The naive Bayes classifier is a specific example of a Bayesian network, where the de- 
pendence of random variables are encoded with a graph structure. While the full theory 
is beyond the scope of this section (see Koller and Friedman (2009) for full details), 
explain why allowing explicit dependence between the two input variables in the XOR 
model allows for the creation of a successful classifier. 


Discussions?’ , 


A.10 Statistics 
TE =] 


Undoubtedly, to be a top deep learning practitioner, the ability to train the state-of-the-art 
and high accurate models is crucial. However, it is often unclear when improvements are 
significant, or only the result of random fluctuations in the training process. To be able to 
discuss uncertainty in estimated values, we must learn some statistics. 


The earliest reference of statistics can be traced back to an Arab scholar Al-Kindi in the 9"- 
century, who gave a detailed description of how to use statistics and frequency analysis to 
decipher encrypted messages. After 800 years, the modern statistics arose from Germany 
in 1700s, when the researchers focused on the demographic and economic data collection 
and analysis. Today, statistics is the science subject that concerns the collection, processing, 
analysis, interpretation and visualization of data. What is more, the core theory of statistics 
has been widely used in the research within academia, industry, and government. 


More specifically, statistics can be divided to descriptive statistics and statistical inference. 
The former focus on summarizing and illustrating the features of a collection of observed 
data, which is referred to as a sample. The sample is drawn from a population, denotes 
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the total set of similar individuals, items, or events of our experiment interests. Contrary to 
descriptive statistics, statistical inference further deduces the characteristics of a population 
from the given samples, based on the assumptions that the sample distribution can replicate 
the population distribution at some degree. 


You may wonder: “What is the essential difference between machine learning and statis- 
tics?” Fundamentally speaking, statistics focuses on the inference problem. This type of 
problems includes modeling the relationship between the variables, such as causal infer- 
ence, and testing the statistically significance of model parameters, such as A/B testing. In 
contrast, machine learning emphasizes on making accurate predictions, without explicitly 
programming and understanding each parameter’s functionality. 


In this section, we will introduce three types of statistics inference methods: evaluating and 
comparing estimators, conducting hypothesis tests, and constructing confidence intervals. 
These methods can help us infer the characteristics of a given population, i.e., the true 
parameter 6. For brevity, we assume that the true parameter 6 of a given population is a 
scalar value. It is straightforward to extend to the case where @ is a vector or a tensor, thus 
we omit it in our discussion. 


A.10.1 Evaluating and Comparing Estimators 


In statistics, an estimator is a function of given samples used to estimate the true parameter 
9. We will write 6, = f(x1,...,Xn) for the estimate of 0 after observing the samples 


ETE E .+5Xn}. 


We have seen simple examples of estimators before in section Section A.7. If you have a 
number of samples from a Bernoulli random variable, then the maximum likelihood esti- 
mate for the probability the random variable is one can be obtained by counting the number 
of ones observed and dividing by the total number of samples. Similarly, an exercise asked 
you to show that the maximum likelihood estimate of the mean of a Gaussian given a num- 
ber of samples is given by the average value of all the samples. These estimators will almost 
never give the true value of the parameter, but ideally for a large number of samples the 
estimate will be close. 


As an example, we show below the true density of a Gaussian random variable with mean 
zero and variance one, along with a collection samples from that Gaussian. We constructed 
the y coordinate so every point is visible and the relationship to the original density is 
clearer. 


import torch 
from d21 import torch as d21 


torch.pi = torch.acos(torch.zeros(1)) x 2 #define pi in torch 


# Sample datapoints and create y coordinate 
epsilon = @.1 

torch.manual_seed(8675309) 

xs = torch. randn(size=(300,)) 


(continues on next page) 
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ys = torch. tensor( 
[torch.sum(torch.exp(-(xsL:i] - xsLi])**2 / (2 x epsilon**2))\ 
/ torch.sqrt(2*torch.pixepsilonx*2)) / len(xs)\ 
for i in range(len(xs))]) 


# Compute true density 
xd = torch.arange(torch.min(xs), torch.max(xs), @.01) 
yd = torch.exp(-xd**2/2) / torch.sqrt(2 x torch.pi) 


# Plot the results 

d21.plot(xd, yd, ‘x’, ‘density’) 

d21.plt.scatter(xs, ys) 

d21.plt.axvline(x=0) 

d21.plt.axvline(x=torch.mean(xs), linestyle='--', color='purple’) 
d21.plt.title(f'sample mean: {float(torch.mean(xs).item()):.2f}') 
d21.plt.show() 


sample mean: 0.00 


density 


There can be many ways to compute an estimator of a parameter ĝÊ„. In this section, we 
introduce three common methods to evaluate and compare estimators: the mean squared 
error, the standard deviation, and statistical bias. 


Mean Squared Error 


Perhaps the simplest metric used to evaluate estimators is the mean squared error (MSE) 
(or l2 loss) estimator which can be defined as 


MSE(6,,,9) = E[(6, — 6)°]. (A.1) 


This allows us to quantify the average squared deviation from the true value. MSE is always 
non-negative. If you have read Section 3.1, you will recognize it as the most commonly used 
regression loss function. As a measure to evaluate an estimator, the closer its value to zero, 
the closer the estimator is close to the true parameter 0. 


Statistical Bias 


The MSE provides a natural metric, but we can easily imagine multiple different phenom- 
ena that might make it large. Two fundamentally important are fluctuation in the estimator 
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due to randomness in the dataset, and systematic error in the estimator due to the estimation 
procedure. 


First, let’s measure the systematic error. For an estimator Ôn, the mathematical illustration 
of statistical bias can be defined as 


bias(6,) = E(0, — 0) = E(6n) — 0. (A.2) 


Note that when bias(6,,) = 0, the expectation of the estimator 6, is equal to the true value 
of parameter. In this case, we say 4, is an unbiased estimator. In general, an unbiased 
estimator is better than a biased estimator since its expectation is the same as the true pa- 
rameter. 


It is worth being aware, however, that biased estimators are frequently used in practice. 
There are cases where unbiased estimators do not exist without further assumptions, or 
are intractable to compute. This may seem like a significant flaw in an estimator, however 
the majority of estimators encountered in practice are at least asymptotically unbiased in 
the sense that the bias tends to zero as the number of available samples tends to infinity: 
lim),s00 bias(6,,) = 0. 


Variance and Standard Deviation 


Second, let’s measure the randomness in the estimator. Recall from Section A.6, the stan- 
dard deviation (or standard error) is defined as the squared root of the variance. We may 
measure the degree of fluctuation of an estimator by measuring the standard deviation or 
variance of that estimator. 


og, = ¥Var(On) = VE[(On - E(On))?I.- (A.3) 


It is important to compare (A.3) to (A.1). In this equation we do not compare to the true 
population value 6, but instead to E(6,,), the expected sample mean. Thus we are not 
measuring how far the estimator tends to be from the true value, but instead we measuring 
the fluctuation of the estimator itself. 


The Bias- Variance Trade-off 


It is intuitively clear that these two main components contribute to the mean squared error. 
What is somewhat shocking is that we can show that this is actually a decomposition of 
the mean squared error into these two contributions plus a third one. That is to say that we 
can write the mean squared error as the sum of the square of the bias, the variance and the 
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irreducible error. 


MSE(6,,, 0) = E[(@n — 0)7] 
= E[(6,)"] + E[6"] - 2E[6,6] 
= Var[6,] + E[„]? + Var[6] + E[@]? — 2E [ĝ„]E [6] 
= (E[9,] — E[6])? + Var[6,] + Var[9] 
= (E[6, — 01)? + Var[6,] + Var[6] 
= (bias[6,,])* + Var(6,) + Var[9]. 


(A.4) 


We refer the above formula as bias-variance trade-off. The mean squared error can be di- 
vided into three sources of error: the error from high bias, the error from high variance and 
the irreducible error. The bias error is commonly seen in a simple model (such as a linear 
regression model), which cannot extract high dimensional relations between the features 
and the outputs. If a model suffers from high bias error, we often say it is underfitting or 
lack of flexibilty as introduced in (Section 3.6). The high variance usually results from a 
too complex model, which overfits the training data. As a result, an overfitting model is 
sensitive to small fluctuations in the data. If a model suffers from high variance, we often 
say it is overfitting and lack of generalization as introduced in (Section 3.6). The irreducible 
error is the result from noise in the 6 itself. 


Evaluating Estimators in Code 


Since the standard deviation of an estimator has been implementing by simply calling a. 
std() for a tensor a, we will skip it but implement the statistical bias and the mean squared 
error. 


# Statistical bias 
def stat_bias(true_theta, est_theta): 
return(torch.mean(est_theta) - true_theta) 


# Mean squared error 
def mse(data, true_theta): 
return(torch.mean(torch.square(data - true_theta))) 


To illustrate the equation of the bias-variance trade-off, let’s simulate of normal distribution 
N(6, a7) with 10,000 samples. Here, we use a 8 = 1 and o = 4. As the estimator is a 
function of the given samples, here we use the mean of the samples as an estimator for true 
8 in this normal distribution N (0, o°?) . 


theta_true = 1 

Sigma = 4 

sample_len = 10000 

samples = torch.normal(theta_true, sigma, size=(sample_len, 1)) 
theta_est = torch.mean(samples) 

theta_est 
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tensor(1.0170) 


Let’s validate the trade-off equation by calculating the summation of the squared bias and 
the variance of our estimator. First, calculate the MSE of our estimator. 


mse(samples, theta_true) 


tensor (16.0298) 


Next, we calculate Var(6,) + [bias(ĝ„)]? as below. As you can see, the two values agree to 
numerical precision. 


bias = stat_bias(theta_true, theta_est) 
torch. square(samples.std(unbiased=False)) + torch.square(bias) 


tensor (16.0298) 


A.10.2 Conducting Hypothesis Tests 


The most commonly encountered topic in statistical inference is hypothesis testing. While 
hypothesis testing was popularized in the early 20th century, the first use can be traced 
back to John Arbuthnot in the 1700s. John tracked 80-year birth records in London and 
concluded that more men were born than women each year. Following that, the modern 
significance testing is the intelligence heritage by Karl Pearson who invented p-value and 
Pearson’s chi-squared test, William Gosset who is the father of Student’s t-distribution, and 
Ronald Fisher who initialed the null hypothesis and the significance test. 


A hypothesis test is a way of evaluating some evidence against the default statement about a 
population. We refer the default statement as the null hypothesis Ho, which we try to reject 
using the observed data. Here, we use Ho as a starting point for the statistical significance 
testing. The alternative hypothesis Ha (or Hı) is a statement that is contrary to the null 
hypothesis. A null hypothesis is often stated in a declarative form which posits a relation- 
ship between variables. It should reflect the brief as explicit as possible, and be testable by 
Statistics theory. 


Imagine you are a chemist. After spending thousands of hours in the lab, you develop a 
new medicine which can dramatically improve one’s ability to understand math. To show 
its magic power, you need to test it. Naturally, you may need some volunteers to take 
the medicine and see whether it can help them learn mathematics better. How do you get 
started? 


First, you will need carefully random selected two groups of volunteers, so that there is no 
difference between their mathematical understanding ability measured by some metrics. 
The two groups are commonly referred to as the test group and the control group. The 
test group (or treatment group) is a group of individuals who will experience the medicine, 
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while the control group represents the group of users who are set aside as a benchmark, 
i.e., identical environment setups except taking this medicine. In this way, the influence 
of all the variables are minimized, except the impact of the independent variable in the 
treatment. 


Second, after a period of taking the medicine, you will need to measure the two groups’ 
mathematical understanding by the same metrics, such as letting the volunteers do the same 
tests after learning a new mathematical formula. Then, you can collect their performance 
and compare the results. In this case, our null hypothesis will be that there is no difference 
between the two groups, and our alternate will be that there is. 


This is still not fully formal. There are many details you have to think of carefully. For 
example, what is the suitable metrics to test their mathematical understanding ability? How 
many volunteers for your test so you can be confident to claim the effectiveness of your 
medicine? How long should you run the test? How do you decide if there is a difference 
between the two groups? Do you care about the average performance only, or also the range 
of variation of the scores? And so on. 


In this way, hypothesis testing provides a framework for experimental design and reasoning 
about certainty in observed results. If we can now show that the null hypothesis is very 
unlikely to be true, we may reject it with confidence. 


To complete the story of how to work with hypothesis testing, we need to now introduce 
some additional terminology and make some of our concepts above formal. 


Statistical Significance 


The statistical significance measures the probability of erroneously rejecting the null hy- 
pothesis, Ho, when it should not be rejected, i.e., 


statistical significance = 1 — œ = 1 — P(reject Ho | Ho is true). (A.5) 


It is also referred to as the type I error or false positive. The a, is called as the significance 
level and its commonly used value is 5%, i.e., 1 — œ = 95%. The significance level can 
be explained as the level of risk that we are willing to take, when we reject a true null 
hypothesis. 


Fig. A.1 shows the observations’ values and probability of a given normal distribution 
in a two-sample hypothesis test. If the observation data example is located outsides the 
95% threshold, it will be a very unlikely observation under the null hypothesis assump- 
tion. Hence, there might be something wrong with the null hypothesis and we will reject 
1t. 


Statistical Power 


The statistical power (or sensitivity) measures the probability of reject the null hypothesis, 
Ho, when it should be rejected, i.e., 


statistical power = 1 — 6 = 1 — P( fail to reject Ho | Ho is false). (A.6) 
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97.5% significance threshold 
(l - o/2) 


Very unusual ' ; Very unusual 
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Probability of observations 


95% Confidence interval 


Value of observations 
An outlier (p-value < 5%) 


Statistical significance. 


Recall that a type I error is error caused by rejecting the null hypothesis when it is true, 
whereas a type II error is resulted from failing to reject the null hypothesis when it is false. 
A type II error is usually denoted as £, and hence the corresponding statistical power is 


1-8. 


Intuitively, statistical power can be interpreted as how likely our test will detect a real dis- 
crepancy of some minimum magnitude at a desired statistical significance level. 80% is 
a commonly used statistical power threshold. The higher the statistical power, the more 
likely we are to detect true differences. 


One of the most common uses of statistical power is in determining the number of samples 
needed. The probability you reject the null hypothesis when itis false depends on the degree 
to which it is false (known as the effect size) and the number of samples you have. As you 
might expect, small effect sizes will require a very large number of samples to be detectable 
with high probability. While beyond the scope of this brief appendix to derive in detail, as 
an example, want to be able to reject a null hypothesis that our sample came from a mean 
zero variance one Gaussian, and we believe that our sample’s mean is actually close to one, 
we can do so with acceptable error rates with a sample size of only 8. However, if we think 
our sample population true mean is close to 0.01, then we’d need a sample size of nearly 
80000 to detect the difference. 


We can imagine the power as a water filter. In this analogy, a high power hypothesis test is 
like a high quality water filtration system that will reduce harmful substances in the water 
as much as possible. On the other hand, a smaller discrepancy is like a low quality water 
filter, where some relative small substances may easily escape from the gaps. Similarly, if 
the statistical power is not of enough high power, then the test may not catch the smaller 
discrepancy. 


Test Statistic 


A test statistic T(x) is a scalar which summarizes some characteristic of the sample data. 
The goal of defining such a statistic is that it should allow us to distinguish between different 
distributions and conduct our hypothesis test. Thinking back to our chemist example, if we 
wish to show that one population performs better than the other, it could be reasonable to 
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take the mean as the test statistic. Different choices of test statistic can lead to statistical 
test with drastically different statistical power. 


Often, T(X) (the distribution of the test statistic under our null hypothesis) will follow, at 
least approximately, a common probability distribution such as a normal distribution when 
considered under the null hypothesis. If we can derive explicitly such a distribution, and 
then measure our test statistic on our dataset, we can safely reject the null hypothesis if our 
statistic is far outside the range that we would expect. Making this quantitative leads us to 
the notion of p-values. 


p-value 


The p-value (or the probability value) is the probability that T(X) is at least as extreme as 
the observed test statistic T(x) assuming that the null hypothesis is true, i.e., 


p-value = Pm, (T(X) = T(x)). (A.7) 


If the p-value is smaller than or equal to a predefined and fixed statistical significance level 
a, we may reject the null hypothesis. Otherwise, we will conclude that we are lack of 
evidence to reject the null hypothesis. For a given population distribution, the region of 
rejection will be the interval contained of all the points which has a p-value smaller than 
the statistical significance level a. 


One-side Test and Two-sided Test 


Normally there are two kinds of significance test: the one-sided test and the two-sided 
test. The one-sided test (or one-tailed test) is applicable when the null hypothesis and the 
alternative hypothesis only have one direction. For example, the null hypothesis may state 
that the true parameter @ is less than or equal to a value c. The alternative hypothesis 
would be that @ is greater than c. That is, the region of rejection is on only one side of the 
sampling distribution. Contrary to the one-sided test, the two-sided test (or two-tailed test) 
is applicable when the region of rejection is on both sides of the sampling distribution. An 
example in this case may have a null hypothesis state that the true parameter 0 is equal to a 
value c. The alternative hypothesis would be that 6 is not equal to c. 


General Steps of Hypothesis Testing 


After getting familiar with the above concepts, let’s go through the general steps of hypoth- 
esis testing. 


1. State the question and establish a null hypotheses Ho. 
2. Set the statistical significance level œ and a statistical power (1 — 8). 


3. Obtain samples through experiments. The number of samples needed will depend on 
the statistical power, and the expected effect size. 


4. Calculate the test statistic and the p-value. 
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5. Make the decision to keep or reject the null hypothesis based on the p-value and the 
statistical significance level œ. 


To conduct a hypothesis test, we start by defining a null hypothesis and a level of risk that 
we are willing to take. Then we calculate the test statistic of the sample, taking an extreme 
value of the test statistic as evidence against the null hypothesis. If the test statistic falls 
within the reject region, we may reject the null hypothesis in favor of the alternative. 


Hypothesis testing is applicable in a variety of scenarios such as the clinical trails and A/B 
testing. 


A.10.3 Constructing Confidence Intervals 


When estimating the value of a parameter 6, point estimators like 6 are of limited utility 
since they contain no notion of uncertainty. Rather, it would be far better if we could 
produce an interval that would contain the true parameter 0 with high probability. If you 
were interested in such ideas a century ago, then you would have been excited to read 
“Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability” 
by Jerzy Neyman (Neyman, 1937), who first introduced the concept of confidence interval 
in 1937. 


To be useful, a confidence interval should be as small as possible for a given degree of 
certainty. Let’s see how to derive it. 


Definition 


Mathematically, a confidence interval for the true parameter @ is an interval C, that com- 
puted from the sample data such that 


Po(Cn 2 0) = 1-a,V0. (A.8) 


Here æ € (0, 1), and 1 — a is called the confidence level or coverage of the interval. This is 
the same « as the significance level as we discussed about above. 


Note that (A.8) is about variable C„, not about the fixed 6. To emphasize this, we write 
Po(C, > 8) rather than Pg(6 € Cn). 


Interpretation 


It is very tempting to interpret a 95% confidence interval as an interval where you can be 
95% sure the true parameter lies, however this is sadly not true. The true parameter is fixed, 
and it is the interval that is random. Thus a better interpretation would be to say that if you 
generated a large number of confidence intervals by this procedure, 95% of the generated 
intervals would contain the true parameter. 


This may seem pedantic, but it can have real implications for the interpretation of the results. 
In particular, we may satisfy (A.8) by constructing intervals that we are almost certain do 
not contain the true value, as long as we only do so rarely enough. We close this section by 
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providing three tempting but false statements. An in-depth discussion of these points can 
be found in Morey et al. (2016). 


e Fallacy 1. Narrow confidence intervals mean we can estimate the parameter precisely. 


e Fallacy 2. The values inside the confidence interval are more likely to be the true value 
than those outside the interval. 


e Fallacy 3. The probability that a particular observed 95% confidence interval contains 
the true value is 95%. 


Sufficed to say, confidence intervals are subtle objects. However, if you keep the interpre- 
tation clear, they can be powerful tools. 


A Gaussian Example 


Let’s discuss the most classical example, the confidence interval for the mean of a Gaussian 
of unknown mean and variance. Suppose we collect n samples {x;}"_, from our Gaussian 
N (u, 07). We can compute estimators for the mean and variance by taking 


n o 1% O SEE ee acs 
fin = F 2 and 67 = aa 2A - fi)’. (A.9) 


If we now consider the random variable 


_Ên-H 
ane (A.10) 


we obtain a random variable following a well-known distribution called the Student’s t- 


distribution on n — 1 degrees of freedom. 


This distribution is very well studied, and it is known, for instance, that as n — ov, it is 
approximately a standard Gaussian, and thus by looking up values of the Gaussian c.d.f. in 
a table, we may conclude that the value of T is in the interval [-1.96, 1.96] at least 95% 
of the time. For finite values of n, the interval needs to be somewhat larger, but are well 
known and precomputed in tables. 


Thus, we may conclude that for large n, 


Ân- H 
p| EE e [-1.96, 1.96] | > 0.95. A11 
nee! | a 


Rearranging this by multiplying both sides by ôn / Vn and then adding An, we obtain 


P| we |p, —196—" fn +1.96% > 0.95. (A.12) 
van 


yan 


Thus we know that we have found our 95% confidence interval: 


He] OG 


vn Van 


It is safe to say that (A.13) is one of the most used formula in statistics. Let’s close our 


fin — 1.96 À (A.13) 
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discussion of statistics by implementing it. For simplicity, we assume we are in the asymp- 
totic regime. Small values of N should include the correct value of t_star obtained either 


programmatically or from a ft-table. 


# PyTorch uses Bessel’s correction by default, which means the use of ddof=1 


+ + 


ddof=0. 


# Number of samples 
N = 1000 


# Sample dataset 
samples = torch.normal(@, 1, size=(N,)) 


# Lookup Students'’s t-distribution c.d.f. 


t_star = 1.96 


# Construct interval 
mu_hat = torch.mean(samples) 
sigma_hat = samples.std(unbiased=True) 


instead of default ddof=0 in numpy. We can use unbiased=False to imitate 


(mu_hat - t_star*sigma_hat/torch.sqrt(torch.tensor(N, dtype=torch.float32)),\ 
mu_hat + t_star*sigma_hat/torch.sqrt(torch.tensor(N, dtype=torch. float32))) 


(tensor(-0.0568), tensor(@.0704)) 


A.10.4 Summary 


Statistics focuses on inference problems, whereas deep learning emphasizes on making 


accurate predictions without explicitly programming and understanding. 


There are three common statistics inference methods: evaluating and comparing estima- 


tors, conducting hypothesis tests, and constructing confidence intervals. 


square error. 


construct by given the samples. 


about a population. 


There are three most common estimators: statistical bias, standard deviation, and mean 


A confidence interval is an estimated range of a true population parameter that we can 


Hypothesis testing is a way of evaluating some evidence against the default statement 


A.10.5 Exercises 


iid 


1. Let X1, X2,..., Xn ~ Unif(0, 6), where “iid” stands for independent and identically 
distributed. Consider the following estimators of 6: 


6 = max{X), Xo,..., Xn}; (A.14) 


e (A.15) 
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e Find the statistical bias, standard deviation, and mean square error of ô. 
e Find the statistical bias, standard deviation, and mean square error of 6. 
e Which estimator is better? 


2. For our chemist example in introduction, can you derive the 5 steps to conduct a two- 
sided hypothesis testing? Given the statistical significance level œ = 0.05 and the sta- 
tistical power 1 — £ = 0.8. 


3. Run the confidence interval code with N = 2 and a = 0.5 for 100 independently gener- 
ated dataset, and plot the resulting intervals (in this case t_star = 1.0). You will see 
several very short intervals which are very far from containing the true mean 0. Does 
this contradict the interpretation of the confidence interval? Do you feel comfortable 
using short intervals to indicate high precision estimates? 


Discussions?°9. 


A.11 Information Theory 
EE) 


The universe is overflowing with information. Information provides a common language 
across disciplinary rifts: from Shakespeare’s Sonnet to researchers’ paper on Cornell ArXiv, 
from Van Gogh’s printing Starry Night to Beethoven’s music Symphony No. 5, from the 
first programming language Plankalkül to the state-of-the-art machine learning algorithms. 
Everything must follow the rules of information theory, no matter the format. With infor- 
mation theory, we can measure and compare how much information is present in different 
signals. In this section, we will investigate the fundamental concepts of information theory 
and applications of information theory in machine learning. 


Before we get started, let’s outline the relationship between machine learning and informa- 
tion theory. Machine learning aims to extract interesting signals from data and make critical 
predictions. On the other hand, information theory studies encoding, decoding, transmit- 
ting, and manipulating information. As a result, information theory provides fundamental 
language for discussing the information processing in machine learned systems. For exam- 
ple, many machine learning applications use the cross-entropy loss as described in Section 
4.1. This loss can be directly derived from information theoretic considerations. 


A.11.1 Information 


Let’s start with the “soul” of information theory: information. Information can be encoded 
in anything with a particular sequence of one or more encoding formats. Suppose that we 
task ourselves with trying to define a notion of information. What could be our starting 
point? 


Consider the following thought experiment. We have a friend with a deck of cards. They 
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will shuffle the deck, flip over some cards, and tell us statements about the cards. We will 
try to assess the information content of each statement. 


First, they flip over a card and tell us, “I see a card.” This provides us with no information 
at all. We were already certain that this was the case so we hope the information should be 
zero. 


Next, they flip over a card and say, “I see a heart.” This provides us some information, 
but in reality there are only 4 different suits that were possible, each equally likely, so we 
are not surprised by this outcome. We hope that whatever the measure of information, this 
event should have low information content. 


Next, they flip over a card and say, “This is the 3 of spades.” This is more information. 
Indeed there were 52 equally likely possible outcomes, and our friend told us which one it 
was. This should be a medium amount of information. 


Let’s take this to the logical extreme. Suppose that finally they flip over every card from the 
deck and read off the entire sequence of the shuffled deck. There are 52! different orders 
to the deck, again all equally likely, so we need a lot of information to know which one it 
is. 


Any notion of information we develop must conform to this intuition. Indeed, in the next 
sections we will learn how to compute that these events have 0 bits, 2 bits, 5.7 bits, and 
225.6 bits of information respectively. 


If we read through these thought experiments, we see a natural idea. As a starting point, 
rather than caring about the knowledge, we may build off the idea that information repre- 
sents the degree of surprise or the abstract possibility of the event. For example, if we want 
to describe an unusual event, we need a lot information. For a common event, we may not 
need much information. 


In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shan- 
non, 1948) establishing the theory of information. In his article, Shannon introduced the 
concept of information entropy for the first time. We will begin our journey here. 


Self-information 


Since information embodies the abstract possibility of an event, how do we map the pos- 
sibility to the number of bits? Shannon introduced the terminology bit as the unit of in- 
formation, which was originally created by John Tukey. So what is a “bit” and why do we 
use it to measure information? Historically, an antique transmitter can only send or receive 
two types of code: 0 and 1. Indeed, binary encoding is still in common use on all modern 
digital computers. In this way, any information is encoded by a series of 0 and 1. And 
hence, a series of binary digits of length n contains n bits of information. 


Now, suppose that for any series of codes, each O or 1 occurs with a probability of 5. 
Hence, an event X with a series of codes of length n, occurs with a probability of F- At 
the same time, as we mentioned before, this series contains n bits of information. So, can 


1021 


Information Theory 


we generalize to a mathematical function which can transfer the probability p to the number 
of bits? Shannon gave the answer by defining self-information 


I(X) = —log,(p), (A.1) 


as the bits of information we have received for this event X. Note that we will always use 
base-2 logarithms in this section. For the sake of simplicity, the rest of this section will omit 
the subscript 2 in the logarithm notation, i.e., log(.) always refers to log,(.). For example, 
the code “0010” has a self-information 


1 
1(70010”) = — log(p(0010”)) = — log (z) = 4 bits. (A.2) 


We can calculate self information as shown below. Before that, let’s first import all the 
necessary packages in this section. 


import torch 
from torch.nn import NLLLoss 


def nansum(x): 
# Define nansum, as pytorch does not offer it inbuilt. 
return x[~torch. isnan(x)].sum() 


def self_information(p): 
return -torch.log2(torch.tensor(p)).item() 


self_information(1 / 64) 


6.0 


A.11.2 Entropy 


As self-information only measures the information of a single discrete event, we need a 
more generalized measure for any random variable of either discrete or continuous distri- 
bution. 


Motivating Entropy 


Let’s try to get specific about what we want. This will be an informal statement of what are 
known as the axioms of Shannon entropy. It will turn out that the following collection of 
common-sense statements force us to a unique definition of information. A formal version 
of these axioms, along with several others may be found in Csiszár (2008). 


1. The information we gain by observing a random variable does not depend on what we 
call the elements, or the presence of additional elements which have probability zero. 


2. The information we gain by observing two random variables is no more than the sum 
of the information we gain by observing them separately. If they are independent, then 
it is exactly the sum. 
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3. The information gained when observing (nearly) certain events is (nearly) zero. 


While proving this fact is beyond the scope of our text, it is important to know that this 
uniquely determines the form that entropy must take. The only ambiguity that these allow 
is in the choice of fundamental units, which is most often normalized by making the choice 
we saw before that the information provided by a single fair coin flip is one bit. 


Definition 


For any random variable X that follows a probability distribution P with a probability den- 
sity function (p.d.f.) or a probability mass function (p.m.f.) p(x), we measure the expected 
amount of information through entropy (or Shannon entropy) 


H(X) = —Ex~p[log p(x)]. (A.3) 
To be specific, if X is discrete, 
A(X) =- > pi log pi, where p; = P(X;). (A.4) 
Otherwise, if X is continuous, we also refer entropy as differential entropy 


H(X) = - f rvs) log p(x) dx. (A.5) 
We can define entropy as below. 


def entropy(p): 
entropy = - p * torch.log2(p) 
# Operator ‘nansum* will sum up the non-nan number 
out = nansum(entropy) 
return out 


entropy(torch.tensor([@.1, 0.5, 0.1, @.3])) 


tensor(1.6855) 


Interpretations 


You may be curious: in the entropy definition (A.3), why do we use an expectation of a 
negative logarithm? Here are some intuitions. 


First, why do we use a logarithm function log? Suppose that p(x) = fi(x) fo(x)..., fax), 
where each component function f; (x) is independent from each other. This means that each 
fi (x) contributes independently to the total information obtained from p(x). As discussed 
above, we want the entropy formula to be additive over independent random variables. 
Luckily, log can naturally turn a product of probability distributions to a summation of the 
individual terms. 


Next, why do we use a negative log? Intuitively, more frequent events should contain less 
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information than less common events, since we often gain more information from an un- 
usual case than from an ordinary one. However, log is monotonically increasing with the 
probabilities, and indeed negative for all values in [0, 1]. We need to construct a monoton- 
ically decreasing relationship between the probability of events and their entropy, which 
will ideally be always positive (for nothing we observe should force us to forget what we 
have known). Hence, we add a negative sign in front of log function. 


Last, where does the expectation function come from? Consider a random variable X. We 
can interpret the self-information (—log(p)) as the amount of surprise we have at seeing 
a particular outcome. Indeed, as the probability approaches zero, the surprise becomes 
infinite. Similarly, we can interpret the entropy as the average amount of surprise from 
observing X. For example, imagine that a slot machine system emits statistical indepen- 


dently symbols s;,..., Sg with probabilities p1, ..., pp respectively. Then the entropy of 
this system equals to the average self-information from observing each output, i.e., 
H(S) = X pi (si) =- pi log pi. (A.6) 
Properties of Entropy 


By the above examples and interpretations, we can derive the following properties of en- 
tropy (A.3). Here, we refer to X as an event and P as the probability distribution of 
X. 


e H(X) > 0 for all discrete X (entropy can be negative for continuous X). 


e If X ~ P with a p.d.f. or a p.m.f. p(x), and we try to estimate P by a new probability 
distribution Q with a p.d.f. or a p.m.f. q(x), then 


H(X) = -Ex~p [log p(x)] < -Ex~p [log q(x)], with equality if and only if P = Q. 
(A.7) 


Alternatively, H(X) gives a lower bound of the average number of bits needed to 
encode symbols drawn from P. 


e If X ~ P, then x conveys the maximum amount of information if it spreads evenly among 
all possible outcomes. Specifically, if the probability distribution P is discrete with 
k-class {p1,..., px}, then 

1 
H(X) < log(k), with equality if and only if p; = při (A.8) 
If P is a continuous random variable, then the story becomes much more complicated. 
However, if we additionally impose that P is supported on a finite interval (with all 
values between 0 and 1), then P has the highest entropy if it is the uniform distribution 
on that interval. 


A.11.3 Mutual Information 


Previously we defined entropy of a single random variable X, how about the entropy of a 
pair random variables (X,Y)? We can think of these techniques as trying to answer the 


1024 


Mathematics for Deep Learning 


following type of question, “What information is contained in X and Y together compared 
to each separately? Is there redundant information, or is it all unique?” 


For the following discussion, we always use (X, Y) as a pair of random variables that follows 
a joint probability distribution P with a p.d.f. or a p.m.f. px.y(x, y), while X and Y follow 
probability distribution px (x) and py (y), respectively. 


Joint Entropy 
Similar to entropy of a single random variable (A.3), we define the joint entropy H(X,Y) 


of a pair random variables (X,Y) as 


H(X,Y) = —E(x,y)~p [log px,y (x, y)]. (A.9) 
Precisely, on the one hand, if (X,Y) is a pair of discrete random variables, then 
H(X,Y) = -X >i pxr(, y) log px,y (x, y). (A.10) 
x y 


On the other hand, if (X,Y) is a pair of continuous random variables, then we define the 
differential joint entropy as 


H(X,Y) =- f E E at ay. (A11) 
x,y 


We can think of (A.9) as telling us the total randomness in the pair of random variables. 
As a pair of extremes, if X = Y are two identical random variables, then the information in 
the pair is exactly the information in one and we have H(X,Y) = H(X) = H(Y). On the 
other extreme, if X and Y are independent then H(X,Y) = H(X) + H(Y). Indeed we will 
always have that the information contained in a pair of random variables is no smaller than 
the entropy of either random variable and no more than the sum of both. 


H(X), H(Y) < A(X, Y) < H(X) + H(Y). (A.12) 
Let’s implement joint entropy from scratch. 


def joint_entropy(p_xy): 
joint_ent = -p_xy * torch. log2(p_xy) 
# Operator ‘nansum* will sum up the non-nan number 
out = nansum(joint_ent) 
return out 


joint_entropy(torch.tensor([LQ.1, 0.5], [@.1, @.3]])) 


tensor(1.6855) 


Notice that this is the same code as before, but now we interpret it differently as working 
on the joint distribution of the two random variables. 
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Conditional Entropy 


The joint entropy defined above the amount of information contained in a pair of random 
variables. This is useful, but oftentimes it is not what we care about. Consider the setting 
of machine learning. Let’s take X to be the random variable (or vector of random variables) 
that describes the pixel values of an image, and Y to be the random variable which is the 
class label. X should contain substantial information—a natural image is a complex thing. 
However, the information contained in Y once the image has been show should be low. 
Indeed, the image of a digit should already contain the information about what digit it is 
unless the digit is illegible. Thus, to continue to extend our vocabulary of information 
theory, we need to be able to reason about the information content in a random variable 
conditional on another. 


In the probability theory, we saw the definition of the conditional probability to measure 
the relationship between variables. We now want to analogously define the conditional 
entropy H(Y | X). We can write this as 


H(Y | X) = -E,x,y)~p[log p(y | x)], (A.13) 
where p(y | x) = ae is the conditional probability. Specifically, if (X,Y) is a pair of 
discrete random variables, then 

HO | X) =- "5" p(x, y) log p(y | x). (A.14) 


x F 


If (X,Y) is a pair of continuous random variables, then the differential conditional entropy 
is similarly defined as 


HIX) =- | f p(y) togp(y |x) de dy. (4.15) 
xdy 


It is now natural to ask, how does the conditional entropy H(Y | X) relate to the entropy 
H(X) and the joint entropy H(X,Y)? Using the definitions above, we can express this 
cleanly: 


H(Y | X) = H(X,Y) - H(X). (A.16) 


This has an intuitive interpretation: the information in Y given X (H(Y | X)) is the same 
as the information in both X and Y together (H(X, Y)) minus the information already con- 
tained in X. This gives us the information in Y which is not also represented in X. 


Now, let’s implement conditional entropy (A.13) from scratch. 


def conditional_entropy(p_xy, p_x): 
p_y_given_x = p_xy/p_x 
cond_ent = -p_xy * torch. log2(p_y_given_x) 
# Operator ‘nansum* will sum up the non-nan number 
out = nansum(cond_ent) 
return out 


conditional_entropy(torch.tensor(L[@.1, 0.5], [@.2, @.3]]), 
torch.tensor([@.2, @.8])) 
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tensor (0.8635) 


Mutual Information 


Given the previous setting of random variables (X,Y), you may wonder: “Now that we 
know how much information is contained in Y but not in X, can we similarly ask how much 
information is shared between X and Y?” The answer will be the mutual information of 
(X,Y), which we will write as 1(X,Y). 


Rather than diving straight into the formal definition, let’s practice our intuition by first 
trying to derive an expression for the mutual information entirely based on terms we have 
constructed before. We wish to find the information shared between two random variables. 
One way we could try to do this is to start with all the information contained in both X and 
Y together, and then we take off the parts that are not shared. The information contained in 
both X and Y together is written as H(X, Y). We want to subtract from this the information 
contained in X but not in Y, and the information contained in Y but not in X. As we saw in 
the previous section, this is given by H(X | Y) and H(Y | X) respectively. Thus, we have 
that the mutual information should be 


1(X,Y) = H(X,Y) - HY | X) - H(X | Y). (A.17) 


Indeed, this is a valid definition for the mutual information. If we expand out the definitions 
of these terms and combine them, a little algebra shows that this is the same as 


px,.y (x,y) l. (A.18) 


Px(x)py(y) 
We can summarize all of these relationships in image Fig. A.1. It is an excellent test of 
intuition to see why the following statements are all also equivalent to (X,Y). 


I(X,Y) = ExEy [poxr log 


e H(X)- H(X |Y) 
e HY)-HY | X) 
© H(X) + H(Y) - H(X,Y) 


y HOO Entrop 
gt YHo, 


Conditional 
Entropy 


H(Y1X) 


Joint Entropy H(X, Y) 


Mutual information’s relationship with joint entropy and conditional entropy. 


In many ways we can think of the mutual information (A.18) as principled extension of 
correlation coefficient we saw in Section A.6. This allows us to ask not only for linear 
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relationships between variables, but for the maximum information shared between the two 
random variables of any kind. 


Now, let’s implement mutual information from scratch. 


def mutual_information(p_xy, p_x, p_y): 
p = p_xy / (p_x * p_y) 
mutual = p_xy * torch.1log2(p) 
# Operator ‘nansum* will sum up the non-nan number 
out = nansum(mutual) 
return out 


mutual_information(torch.tensor([[@.1, 0.5], [@.1, ®.3]]), 
torch. tensor([@.2, @.8]), torch.tensor([L0.75, @.25]])) 


tensor (0.7195) 


Properties of Mutual Information 


Rather than memorizing the definition of mutual information (A.18), you only need to keep 
in mind its notable properties: 


e Mutual information is symmetric, i.e., /(X,Y) = I(Y, X). 
e Mutual information is non-negative, i.e., /(X,Y) > 0. 


e I(X,Y) = 0 if and only if X and Y are independent. For example, if X and Y are in- 
dependent, then knowing Y does not give any information about X and vice versa, so 
their mutual information is zero. 


e Alternatively, if X is an invertible function of Y, then Y and X share all information and 


1(X,Y) = H(Y) = H(X). (A.19) 


Pointwise Mutual Information 


When we worked with entropy at the beginning of this chapter, we were able to provide an 
interpretation of —log(px(x)) as how surprised we were with the particular outcome. We 
may give a similar interpretation to the logarithmic term in the mutual information, which 
is often referred to as the pointwise mutual information: 


Pxy (x,y) 


px) pro) e 


pmi(x, y) = log 
We can think of (A.20) as measuring how much more or less likely the specific combina- 
tion of outcomes x and y are compared to what we would expect for independent random 
outcomes. If it is large and positive, then these two specific outcomes occur much more fre- 
quently than they would compared to random chance (note: the denominator is px(x) py (y) 
which is the probability of the two outcomes were independent), whereas if it is large and 
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negative it represents the two outcomes happening far less than we would expect by random 
chance. 


This allows us to interpret the mutual information (A.18) as the average amount that we 
were surprised to see two outcomes occurring together compared to what we would expect 
if they were independent. 


Applications of Mutual Information 


Mutual information may be a little abstract in it pure definition, so how does it related to 
machine learning? In natural language processing, one of the most difficult problems is the 
ambiguity resolution, or the issue of the meaning of a word being unclear from context. 
For example, recently a headline in the news reported that “Amazon is on fire”. You may 
wonder whether the company Amazon has a building on fire, or the Amazon rain forest is 
on fire. 


In this case, mutual information can help us resolve this ambiguity. We first find the group 
of words that each has a relatively large mutual information with the company Amazon, 
such as e-commerce, technology, and online. Second, we find another group of words that 
each has a relatively large mutual information with the Amazon rain forest, such as rain, 
forest, and tropical. When we need to disambiguate “Amazon”, we can compare which 
group has more occurrence in the context of the word Amazon. In this case the article 
would go on to describe the forest, and make the context clear. 


A.11.4 Kullback—-Leibler Divergence 


As what we have discussed in Section 2.3, we can use norms to measure distance between 
two points in space of any dimensionality. We would like to be able to do a similar task 
with probability distributions. There are many ways to go about this, but information theory 
provides one of the nicest. We now explore the Kullback-Leibler (KL) divergence, which 
provides a way to measure if two distributions are close together or not. 


Definition 


Given a random variable X that follows the probability distribution P with a p.d.f. ora 
p-m.f. p(x), and we estimate P by another probability distribution Q with a p.d.f. or a 
p.m.f. g(x). Then the Kullback—Leibler (KL) divergence (or relative entropy) between P 
and Q is 


Dx (P||Q) = Ex~p 


log me : (A.21) 
q(x) 

As with the pointwise mutual information (A.20), we can again provide an interpretation of 

the logarithmic term: — log as = —log(q(x)) — (— log(p(x))) will be large and positive 

if we see x far more often under P than we would expect for Q, and large and negative if 

we see the outcome far less than expected. In this way, we can interpret it as our relative 

surprise at observing the outcome compared to how surprised we would be observing it 


from our reference distribution. 
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Let’s implement the KL divergence from Scratch. 


def kl_divergence(p, q): 
kl = p * torch.log2(p / q) 
out = nansum(k1l) 
return out.abs().item() 


KL Divergence Properties 


Let’s take a look at some properties of the KL divergence (A.21). 


e KL divergence is non-symmetric, i.e., there are P, Q such that 


Dxi(PIlQ) + Dei (QIIP). (A.22) 


e KL divergence is non-negative, i.e., 
Dx (PQ) = 0. (A.23) 


Note that the equality holds only when P = Q. 
e If there exists an x such that p(x) > 0 and q(x) = 0, then Dg (P||Q) = œ. 


e There is a close relationship between KL divergence and mutual information. Besides 
the relationship shown in Fig. A.1, I(X,Y) is also numerically equivalent with the 
following terms: 


1. Det (P(X, Y) || P(X)P(Y)); 
2. Ey{Dxi(P(X |Y) || P(X))}; 
3. Ex{Dxi(P(¥ |X) || PŒ}. 


For the first term, we interpret mutual information as the KL divergence between 
P(X,Y) and the product of P(X) and P(Y), and thus is a measure of how differ- 
ent the joint distribution is from the distribution if they were independent. For the 
second term, mutual information tells us the average reduction in uncertainty about Y 
that results from learning the value of the X’s distribution. Similarly to the third term. 


Example 


Let’s go through a toy example to see the non-symmetry explicitly. 


First, let’s generate and sort three tensors of length 10,000: an objective tensor p which 
follows a normal distribution N(0, 1), and two candidate tensors qı and q2 which follow 
normal distributions N(—1, 1) and N(1, 1) respectively. 
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torch.manual_seed(1) 


tensor_len = 10000 

p = torch.normal(@, 1, (tensor_len, )) 

qi = torch.normal(-1, 1, (tensor_len, )) 
q2 = torch.normal(1, 1, (tensor_len, )) 


p = torch.sort(p)[0] 
ql = torch.sort(q1)[0] 
q2 = torch.sort(q2)[@] 


Since qı and q2 are symmetric with respect to the y-axis (i.e., x = 0), we expect a similar 
value of KL divergence between Dx, (p||g;) and Dx, (p||g2). As you can see below, there 
is only a less than 3% off between Dx, (p||q1) and Dx, (pllqz). 


kl_pql = kl_divergence(p, ql) 
kl_pq2 = kl_divergence(p, q2) 
similar_percentage = abs(kl_pql - kl_pq2) / ((kl_pq1 + kl_pq2) / 2) * 100 


kl_pql, kl_pq2, similar_percentage 


(8582.0341796875, 8828.3095703125, 2.8290698237936858) 


In contrast, you may find that Dx (q2||p) and Dg (p|lqz) are off a lot, with around 40% 
off as shown below. 


kl_q2p = kl_divergence(q2, p) 
differ_percentage = abs(kl_q2p - kl_pq2) / ((kl_q2p + kl_pq2) / 2) * 100 


kl_q2p, differ_percentage 


(14130.125, 46.18621024399691) 


A.11.5 Cross-Entropy 


If you are curious about applications of information theory in deep learning, here is a quick 
example. We define the true distribution P with probability distribution p(x), and the 
estimated distribution Q with probability distribution g(x), and we will use them in the 
rest of this section. 


Say we need to solve a binary classification problem based on given n data examples 
{x1,..-,Xn}. Assume that we encode 1 and 0 as the positive and negative class label y; 
respectively, and our neural network is parametrized by 6. If we aim to find a best 0 so 
that $; = po(s; | xi), it is natural to apply the maximum log-likelihood approach as was 
seen in Section A.7. To be specific, for true labels y; and predictions ĵ; = pg(y; | x;), the 
probability to be classified as positive is 7; = pg(y; = 1 | x;). Hence, the log-likelihood 
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function would be 


1(@) = log L(@) 


=lo mi (1 —2;)!-% 
sl | pte) (A.24) 


= yi log(7;) + (1 — y;) log(1 — 7;). 
i=l 


Maximizing the log-likelihood function /(@) is identical to minimizing —/(6@), and hence 
we can find the best 6 from here. To generalize the above loss to any distributions, we also 
called —/(0) the cross-entropy loss CE(y, }), where y follows the true distribution P and $ 
follows the estimated distribution Q. 


This was all derived by working from the maximum likelihood point of view. However, if 
we look closely we can see that terms like log(z;) have entered into our computation which 
is a solid indication that we can understand the expression from an information theoretic 
point of view. 


Formal Definition 


Like KL divergence, for a random variable X, we can also measure the divergence between 
the estimating distribution Q and the true distribution P via cross-entropy, 


CE(P, Q) = -Ex-~p [log(4(x))]. (A.25) 


By using properties of entropy discussed above, we can also interpret it as the summation 
of the entropy H(P) and the KL divergence between P and Q, i.e., 


CE(P, Q) = H(P) + DxL(PIIQ). (A.26) 
We can implement the cross-entropy loss as below. 


def cross_entropy(y_hat, y): 
ce = -torch.log(y_hat[range(len(y_hat)), yl) 
return ce.mean() 


Now define two tensors for the labels and predictions, and calculate the cross-entropy loss 
of them. 


labels = torch.tensor(LQ, 2]) 
preds = torch.tensor([[0.3, 0.6, 0.1], [@.2, 0.3, @.5]]) 


cross_entropy(preds, labels) 


tensor (@. 9486) 
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Properties 


As alluded in the beginning of this section, cross-entropy (A.25) can be used to define a loss 
function in the optimization problem. It turns out that the following are equivalent: 


1. Maximizing predictive probability of Q for distribution P, (i.e., Ex~p[log(q(x))]); 
2. Minimizing cross-entropy CE(P, Q); 
3. Minimizing the KL divergence Dg, (P||Q). 


The definition of cross-entropy indirectly proves the equivalent relationship between ob- 
jective 2 and objective 3, as long as the entropy of true data H(P) is constant. 


Cross-Entropy as An Objective Function of Multi-class Classification 


If we dive deep into the classification objective function with cross-entropy loss CE, we 
will find minimizing CE is equivalent to maximizing the log-likelihood function L. 


To begin with, suppose that we are given a dataset with n examples, and it can be classified 
into k-classes. For each data example i, we represent any k-class label y; = (yi1,.--, Yik) 
by one-hot encoding. To be specific, if the example i belongs to class j, then we set the 
j-th entry to 1, and all other components to 0, i.e., 


1 jeJ; 
= A.27 
a ( otherwise. ( ) 


For instance, if a multi-class classification problem contains three classes A, B, and C, then 
the labels y; can be encoded in {A : (1,0,0); B : (0,1,0); C : (0,0, 1)}. 


Assume that our neural network is parametrized by 8. For true label vectors y; and predic- 
tions 


k 
Îi = Polyi | Xi) = X Yupo | xi). (A.28) 
Ja 


Hence, the cross-entropy loss would be 


n n k 
CE(y, 9) =- $ yilogĝ: =- $) yi log pa(yis | xi). (A.29) 


=I i=l j=l 


On the other side, we can also approach the problem through maximum likelihood es- 
timation. To begin with, let’s quickly introduce a k-class multinoulli distribution. It is 
an extension of the Bernoulli distribution from binary class to multi-class. If a random 
variable z = (z1,...,Zx) follows a k-class multinoulli distribution with probabilities p = 


(P1,--+5 Pk), 1e., 


k 
p(z) = p(z1,..., zk) = Multi(pi,..., pk), where Xp = 1, (A.30) 
i=l 


1033 


Information Theory 


then the joint probability mass function(p.m.f.) of z is 


k 
r=] le? (A.31) 
j=1 


It can be seen that the label of each data example, y;, is following a k-class multinoulli 
distribution with probabilities m = (7,...,7,). Therefore, the joint p.m.f. of each data 
example y; is BY = M$ m” . Hence, the log-likelihood function would be 


n n k n k 
1(0) = log L(0) = log I] zY! = log [| I] a = >», y yi; log rj. (A.32) 
i=l 


i=l j=l i=l j=l 


Since in maximum likelihood estimation, we maximizing the objective function /(6) by 
having 7; = pe(yi; | Xi). Therefore, for any multi-class classification, maximizing the 
above log-likelihood function /(@) is equivalent to minimizing the CE loss CE(y, #). 


To test the above proof, let’s apply the built-in measure NegativeLogLikelihood. Using 
the same labels and preds as in the earlier example, we will get the same numerical loss 
as the previous example up to the 5 decimal place. 


# Implementation of cross-entropy loss in PyTorch combines ‘nn.LogSoftmax()* 
# and ‘nn.NLLLoss()* 

nll_loss = NLLLoss() 

loss = nll_loss(torch.log(preds), labels) 

loss 


tensor (@. 9486) 


A.11.6 Summary 


Information theory is a field of study about encoding, decoding, transmitting, and ma- 
nipulating information. 


Entropy is the unit to measure how much information is presented in different signals. 


KL divergence can also measure the divergence between two distributions. 


Cross-entropy can be viewed as an objective function of multi-class classification. Min- 
imizing cross-entropy loss is equivalent to maximizing the log-likelihood function. 


A.11.7 Exercises 


1. Verify that the card examples from the first section indeed have the claimed entropy. 


2. Show that the KL divergence D (p||q) is nonnegative for all distributions p and q. Hint: 
use Jensen’s inequality, i.e., use the fact that — log x is a convex function. 


3. Let’s compute the entropy from a few data sources: 
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e Assume that you are watching the output generated by a monkey at a typewriter. The 
monkey presses any of the 44 keys of the typewriter at random (you can assume 
that it has not discovered any special keys or the shift key yet). How many bits of 
randomness per character do you observe? 


e Being unhappy with the monkey, you replaced it by a drunk typesetter. It is able 
to generate words, albeit not coherently. Instead, it picks a random word out of 
a vocabulary of 2,000 words. Let’s assume that the average length of a word is 
4.5 letters in English. How many bits of randomness per character do you observe 
now? 


e Still being unhappy with the result, you replace the typesetter by a high quality lan- 
guage model. The language model can currently obtain a perplexity as low as 15 
points per word. The character perplexity of a language model is defined as the 
inverse of the geometric mean of a set of probabilities, each probability is corre- 
sponding to a character in the word. To be specific, if the length of a given word is 
L, then PPL(word) = [[]; p(character;)]~7 = exp [-+ Di log p(character;) | . As- 
sume that the test word has 4.5 letters, how many bits of randomness per character 
do you observe now? 


4. Explain intuitively why /(X,Y) = H(X) —- H(X | Y). Then, show this is true by 
expressing both sides as an expectation with respect to the joint distribution. 


5. What is the KL Divergence between the two Gaussian distributions N (u1, o?) and 
N (u2, 73)? 


Discussions?’®, 
290 


Fig. B.1 
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To get the most out of Dive into Deep Learning, we will talk you through different tools in 
this appendix, such as for running and contributing to this interactive open-source book. 


B.1 Using Jupyter Notebooks 
ee 


This section describes how to edit and run the code in each section of this book using 
the Jupyter Notebook. Make sure you have installed Jupyter and downloaded the code as 
described in Installation (page xxxiv). If you want to know more about Jupyter see the 
excellent tutorial in their documentation?! . 


B.1.1 Editing and Running the Code Locally 


Suppose that the local path of the book’s code is xx/yy/d21-en/. Use the shell to change 
the directory to this path (cd xx/yy/d21-en) and run the command jupyter notebook. 
If your browser does not do this automatically, open http://localhost:8888 and you will see 
the interface of Jupyter and all the folders containing the code of the book, as shown in Fig. 
B.1. 


~ 
y ju pyter Quit Logout 
Files Running Clusters Nbextensions 
Select items to perform actions on them. Upload Newry © 

Oo ~ ®/ Name ® Last Modified File size 
O O build seconds ago 
O O chapter_appendix-mathematics-for-deep-learning 6 days ago 
O O chapter_attention-mechanisms an hour ago 


The folders containing the code of this book. 


You can access the notebook files by clicking on the folder displayed on the webpage. They 
usually have the suffix “.ipynb”. For the sake of brevity, we create a temporary “test.ipynb” 
file. The content displayed after you click it is shown in Fig. B.2. This notebook includes a 
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markdown cell and a code cell. The content in the markdown cell includes “This Is a Title” 
and “This is text.”. The code cell contains two lines of Python code. 


T J u pyter test (unsaved changes) ® Logout 


File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O 


+ x ® BR 44v PRu E C DP Markdown v Biv 


This Is a Title 


This is text. 


In [ ]: import numpy as np 
np.ones((3, 4)) 


| Markdown and code cells in the “text.ipynb” file. 


Double click on the markdown cell to enter edit mode. Add a new text string “Hello world.” 
at the end of the cell, as shown in Fig. B.3. 


a J u pyter test (unsaved changes) & Logout 


File Edit View Insert Cell Kernel Widgets Help Trusted S | Python 3 (0) 


D + xB 4 6% PRU E C DP Markdown v| Belly 


# This Is a Title 


This is text. Hello world.| 


In [ ]: import numpy as np 
np.ones((3, 4)) 


Edit the markdown cell. 


As demonstrated in Fig. B.4, click “Cell” — “Run Cells” in the menu bar to run the edited 
cell. 


After running, the markdown cell is shown in Fig. B.5. 


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown 
in Fig. B.6. 


You can also run the cell with a shortcut (“Ctrl + Enter” by default) and obtain the output 
result from Fig. B.7. 


When a notebook contains more cells, we can click “Kernel” — “Restart & Run Al” in the 
menu bar to run all the cells in the entire notebook. By clicking “Help” — “Edit Keyboard 
Shortcuts” in the menu bar, you can edit the shortcuts according to your preferences. 
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m J u pyte r test (unsaved changes) ® Logout 
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O 
+ x ® BH *® v RunCells » Biv 


Run Cells and Select Below |°] 
Run Cells and Insert Below [Xe] — 


# This Is rua 


Run All Above 
This is text. 
Run All Below 
In [ ]: import numpy 
np.ones((3, ‘ Cell Type i 
Current Outputs » 
Fig. B.4 Run the cell. 
2 J u pyter test (unsaved changes) ® Logout 
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O 


A + x @® BH AA v PRU E C DP Markdown , ev 


This Is a Title 


This is text. Hello world. 


In [ ]: import numpy as np 
np.ones((3, 4)) 


Fig. B.5 The markdown cell after running. 


o jupyter test (unsaved changes) ® Logout 
File Edit View Insert Cell Kernel Widgets Help Trusted # | Python 3 O 
B + xa t & PRU E C DP Code ie | Po) 

This Is a Title 


This is text. Hello world. 


In [ ]: import numpy as np 
np.ones((3, 4)) * 2| 


Fig. B.6 Edit the code cell. 


1038 Tools for Deep Learning 


t 

w J u pyter test (unsaved changes) a Logout 
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O 

+ x © RH ® % PRU HB C DP Code yav 


This Is a Title 


This is text. Hello world. 
In [1]: import numpy as np 
np.ones((3, 4)) * 2 


Out[1l]: array([[2., 2., 2., 2 
[Ana Zag Zep Bids 
2p Deg 2 


Run the code cell to obtain the output. 


B.1.2 Advanced Options 


Beyond local editing two things are quite important: editing the notebooks in the markdown 
format and running Jupyter remotely. The latter matters when we want to run the code ona 
faster server. The former matters since Jupyter’s native ipynb format stores a lot of auxiliary 
data that is irrelevant to the content, mostly related to how and where the code is run. This 
is confusing for Git, making reviewing contributions very difficult. Fortunately there is an 
alternative—native editing in the markdown format. 


Markdown Files in Jupyter 


If you wish to contribute to the content of this book, you need to modify the source file (md 
file, not ipynb file) on GitHub. Using the notedown plugin we can modify notebooks in the 
md format directly in Jupyter. 


First, install the notedown plugin, run the Jupyter Notebook, and load the plugin: 


pip install d2l-notedown # You may need to uninstall the original notedown. 
jupyter notebook --NotebookApp.contents_manager_class='notedown. 
~NotedownContentsManager' 


You may also turn on the notedown plugin by default whenever you run the Jupyter Note- 
book. First, generate a Jupyter Notebook configuration file (if it has already been generated, 
you can skip this step). 


jupyter notebook --generate-config 


Then, add the following line to the end of the Jupyter Notebook configuration file (for Linux 
or macOS, usually in the path ~/. jupyter/jupyter_notebook_config. py): 
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c.NotebookApp. contents_manager_class = 'notedown.NotedownContentsManager'’ 


After that, you only need to run the jupyter notebook command to turn on the notedown 
plugin by default. 


Running Jupyter Notebooks on a Remote Server 


Sometimes, you may want to run Jupyter notebooks on a remote server and access it through 
a browser on your local computer. If Linux or macOS is installed on your local machine 
(Windows can also support this function through third-party software such as PuTTY), you 
can use port forwarding: 


ssh myserver -L 8888:localhost: 8888 


The above string myserver is the address of the remote server. Then we can use http: 
/Nocalhost:8888 to access the remote server myserver that runs Jupyter notebooks. We 
will detail on how to run Jupyter notebooks on AWS instances later in this appendix. 


Timing 
We can use the ExecuteTime plugin to time the execution of each code cell in Jupyter 


notebooks. Use the following commands to install the plugin: 


pip install jupyter_contrib_nbextensions 
jupyter contrib nbextension install --user 
jupyter nbextension enable execute_time/ExecuteTime 


B.1.3 Summary 


e Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the 
book. 


e We can run Jupyter notebooks on remote servers using port forwarding. 


B.1.4 Exercises 


1. Edit and run the code in this book with the Jupyter Notebook on your local machine. 


2. Edit and run the code in this book with the Jupyter Notebook remotely via port forward- 
ing. 


3. Compare the running time of the operations A'B and AB for two square matrices in 
R1024x1024 | Which one is faster? 


Discussions?’ , 
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B.2 Using Amazon SageMaker 
| 


Deep learning applications may demand so much computational resource that easily goes 
beyond what your local machine can offer. Cloud computing services allow you to run 
GPU-intensive code of this book more easily using more powerful computers. This section 
will introduce how to use Amazon SageMaker to run the code of this book. 


B.2.1 Signing Up 


First, we need to sign up an account at https://aws.amazon.com/. For additional security, 

using two-factor authentication is encouraged. It is also a good idea to set up detailed billing 

and spending alerts to avoid any surprise, e.g., when forgetting to stop running instances. 
293 After logging into your AWS account, go to your console 793 and search for “Amazon 
SageMaker” (see Fig. B.1), then click it to open the SageMaker panel. 


AWS services 


Find Services 
You can enter names, keywords or acronyms. 


| Q sage 


Amazon SageMaker 
Build, Train, and Deploy Machine Learning Models 


Search for and open the SageMaker panel. 


B.2.2 Creating a SageMaker Instance 


Next, let’s create a notebook instance as described in Fig. B.2. 


Amazon SageMaker X Amazon SageMaker > Notebook instances 


Amazon SageMaker Studio 3 


Dashboard 
Q Search notebook instances 1 © 
Search 


< 


Ground Truth Name ¥ Instance Creation time Y Status Y Actions 
Labeling jobs 


. There are currently no resources. 
Labeling datasets 


Labeling workforces 


v Notebook 


Notebook instances 


Lifecycle configurations 


Git repositories 


Fig. B.2 Create a SageMaker instance. 
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SageMaker provides multiple instance types 7°* with varying computational power and 
prices. When creating a notebook instance, we can specify its name and type. In Fig. B.3, 
we choose ml.p3.2xlarge: with one Tesla V100 GPU and an 8-core CPU, this instance is 
powerful enough for most of the book. 


Notebook instance settings 


Notebook instance name 


D2L 
Maximum of 63 alphanumeric characters. Can include hyphens (-), but not spaces. Must be unique within your accou 


Notebook instance type 


ml.p3.2xlarge v 


Choose the instance type. 


The entire book in the ipynb format for running with SageMaker is available at https:// 
github.com/d2l-ai/d2l-pytorch-sagemaker. We can specify this GitHub repository URL 
(Fig. B.4) to allow SageMaker to clone it when creating the instance. 


v Git repositories - optional 


v Default repository 


Repository 
Jupyter will start in this repository. Repositories are added to your home directory. 


Clone a public Git repository to this notebook instance only v GCG 


Git repository URL 


Clone a ~ to use for this notebook instance = 


Specify the GitHub repository. 


B.2.3 Running and Stopping an Instance 


Creating an instance may take a few minutes. When it is ready, click on the “Open Jupyter” 
link next to it (Fig. B.5) so you can edit and run all the Jupyter notebooks of this book on 
this instance (similar to steps in Section B.1). 


Name ¥ Instance Creation time v Status v Actions 
D2L ml.p3.2xlarge Dec 18, 2019 19:16 UTC © InService Open Jupyter J Open JupyterLab 


Open Jupyter on the created SageMaker instance. 


After finishing your work, do not forget to stop the instance to avoid being charged further 
(Fig. B.6). 
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Notebook instances Actions Y Create notebook 


Open Jupyter 


Q Search notebook instances 1 


Open JupyterLab 
[se 
Name Vv Instance z tatus v 
o D2L ml.p3.2xlarge Add/Edit tags 9 InService 


Fig. B.6 Stop a SageMaker instance. 


B.2.4 Updating Notebooks 


Notebooks of this open-source book will be regularly updated in the d2l-ai/d2l-pytorch- 
sagemaker ?°° repository on GitHub. To update to the latest version, you may open a 
terminal on the SageMaker instance (Fig. B.7). 


Upload S 


Other: 
Text File 
Folder 


Terminal 


Open a terminal on the SageMaker instance. 


You may wish to commit your local changes before pulling updates from the remote repos- 
itory. Otherwise, simply discard all your local changes with the following commands in 
the terminal: 


cd SageMaker/d2l-pytorch-sagemaker/ 
git reset --hard 
git pull 


B.2.5 Summary 


e We can create a notebook instance using Amazon SageMaker to run GPU-intensive code 
of this book. 


e We can update notebooks via the terminal on the Amazon SageMaker instance. 


B.2.6 Exercises 


1. Edit and run any section that requires a GPU using Amazon SageMaker. 
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2. Open a terminal to access the local directory that hosts all the notebooks of this book. 


Discussions?” . 


B.3 Using AWS EC2 Instances 
ee 


In this section, we will show you how to install all libraries on a raw Linux machine. Recall 
that in Section B.2 we discussed how to use Amazon SageMaker, while building an instance 
by yourself costs less on AWS. The walkthrough includes three steps: 


1. Request for a GPU Linux instance from AWS EC2. 
2. Install CUDA (or use an Amazon Machine Image with preinstalled CUDA). 
3. Install the deep learning framework and other libraries for running the code of the book. 


This process applies to other instances (and other clouds), too, albeit with some minor 
modifications. Before going forward, you need to create an AWS account, see Section B.2 
for more details. 


B.3.1 Creating and Running an EC2 Instance 
After logging into your AWS account, click “EC?” (Fig. B.1) to go to the EC2 panel. 


Recently visited lm: Com pu te 


Favorites 
All services AWS App Runner 


Build and run production web applications at scale 


Batch 


Fully managed batch processing at any scale 


E Analytics 
E] Application Integration 


© AR&VR EC2 

Virtual Servers in the Cloud 
(S| AWS Cost Management 
== Blockchain EC2 Image Builder 


7 es A managed service to automate build, customize and deploy OS images 
ff Business Applications 


oO Elastic Beanstalk 


Open the EC2 console. 


Fig. B.2 shows the EC2 panel. 


Presetting Location 


Select a nearby data center to reduce latency, e.g., “Oregon” (marked by the red box in the 
top-right of Fig. B.2). If you are located in China, you can select a nearby Asia Pacific 
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aws 


EE services | Q. Search ] ) Oregon ¥ 


© New EC2 Experience x 


Tellus what you think Launch instance Service health Explore 
To get started, launch an Amazon EC2 instance, which is a 
EC2 Dashboard virtual server in the cloud. | GCG | | AWS Health Dashboard [4 
EC2 Global View Amazom G 
s GuardDuty 
Events Launch instance a Region detection i 


US West (Oregon) 
Tags [[_taunchinstance | a workloads 


n m 7 Save up to 
© This service is operating normally 
Yv Instances Note: Your instances will launch in the US West (Oregon) Optimize p 
Region purchase o 
Instances New 
Zones more [4 


The EC2 panel. 


region, such as Seoul or Tokyo. Please note that some data centers may not have GPU 
instances. 


Increasing Limits 


Before choosing an instance, check if there are quantity restrictions by clicking the “Lim- 
its” label in the bar on the left as shown in Fig. B.2. Fig. B.3 shows an example of such 
a limitation. The account currently cannot open “p2.xlarge” instances according to the re- 
gion. If you need to open one or more instances, click on the “Request limit increase” link 
to apply for a higher instance quota. Generally, it takes one business day to process an 
application. 


aws Services v Resource Groups v % A 


EC2 Dashboard 


Events Running On-Demand mSd.metal instances o Request limit increase 
Tags 4 Running On-Demand m5d.xlarge instances 2 Request limit increase 
Reports 
Running On-Demand p2.16xlarge instances o Request limit increase 
| [Ems] o satis ee 
S Running On-Demand p2.8xlarge instances o Request limit increase 
Instances Running On-Demand p2.xlarge instances o Request limit increase 
Launch Templates 
Running On-Demand p3.16xlarge instances 0 Request limit increase 
Spot Requests g P: rge na z 
Reserved Instances Running On-Demand p3.2xlarge instances 0 Request limit increase 
De Host: 
= Running On-Demand p3.8xlarge instances o Request limit increase 
Scheduled instances 
Running On-Demand p3dn.24xlarge instances 0 Request limit increase 


Capacity Reservations 


Instance quantity restrictions. 


Launching an Instance 


Next, click the “Launch Instance” button marked by the red box in Fig. B.2 to launch your 
instance. 


We begin by selecting a suitable Amazon Machine Image (AMI). Select an Ubuntu instance 
(Fig. B.4). 


EC2 provides many different instance configurations to choose from. This can sometimes 
feel overwhelming to a beginner. tab_ec2 lists different suitable machines. 


:Different EC2 instance types 
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v Application and OS Images (Amazon Machine Image) info 


An AMI is a template that contains the software configuration (operating system, application server, and applications) required to 
launch your instance. Search or Browse for AMIs if you don’t see what you are looking for below 


Q, Search our full catalog including 1000s of application and OS images 


Recents My AMIs Quick Start 
Amazon macOS Ubuntu Windows Red Hat S Q 
Linux 
> Browse more AMIs 
aws E o || muyi Including AMIs f 
A ubuntu Microsoft RedHat ncluding AMIs from 
a Mac me a AWS, Marketplace and 
a the Community 


Amazon Machine Image (AMI) 


Ubuntu Server 22.04 LTS (HVM), SSD Volume Type Free tier eligible 


ami-017fecd1353bcc96e (64-bit (x86)) / ami-Odb84aebfa8d17e23 (64-bit (Arm)) 


Virtualization: hvm 


S27) Choose an AMI. 


v 


ENA enabled: true Root device type: ebs 


Table B.1: label:tab_ec2 


Name | GPU Notes 

g2 Grid K520 ancient 

p2 Kepler K80 old but often cheap as spot 

g3 Maxwell M60 | good trade-off 

p3 Volta V100 high performance for FP16 

p4 Ampere A100 | high performance for large-scale training 
g4 Turing T4 inference optimized FP16/INT8 


All these servers come in multiple flavors indicating the number of GPUs used. For exam- 
ple, a p2.xlarge has 1 GPU and a p2.16xlarge has 16 GPUs and more memory. For more 


details, see the AWS EC2 documentation ?®? or a summary page °°. For the purpose of 


Harel illustration, a p2.xlarge will suffice (marked in the red box of Fig. B.5). 


v Instance type 


Instance type 


p2.xlarge 
Family:p2 4vCPU 


Info 


61 GiB Memory 


On-Demand Linux pricing: 0.9 USD per Hour 
On-Demand Windows pricing: 1.084 USD per Hour 


Fig. B.S Choose an instance. 


Compare instance types 
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Note that you should use a GPU-enabled instance with suitable drivers and a GPU-enabled 
deep learning framework. Otherwise you will not see any benefit from using GPUs. 


We go on to select the key pair used to access the instance. If you do not have a key pair, 
click “Create new key pair” in Fig. B.6 to generate a key pair. Subsequently, you can select 
the previously generated key pair. Make sure that you download the key pair and store 
it in a safe location if you generated a new one. This is your only way to SSH into the 
server. 


v Key pair (login) info 
You can use a key pair to securely connect to your instance. Ensure that you have access to the selected key pair before you 
launch the instance. 


Key pair name - required 


D2L_key v G Create new key 
pair 


Select a key pair. 


In this example, we will keep the default configurations for “Network settings” (click the 
“Edit” button to configure items such as the subnet and security groups). We just increase 
the default hard disk size to 64 GB (Fig. B.7). Note that CUDA by itself already takes up 
4 GB. 


v Configure storage info Advanced 


1x | 64 GIB gp2 v Root volume (Not encrypted) 


@® Free tier eligible customers can get up to 30 GB of EBS General Purpose (SSD) or Magnetic x 
storage 


Add new volume 


Modify the hard disk size. 


Click “Launch Instance” to launch the created instance. Click the instance ID shown in Fig. 
B.8 to view the status of this instance. 


Connecting to the Instance 


As shown in Fig. B.9, after the instance state turns green, right-click the instance and select 
Connect to view the instance access method. 


If this is a new key, it must not be publicly viewable for SSH to work. Go to the folder where 
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EC2 > Instances > Launch an instance 


© Success 
Successfully initiated launch of instanca (i-07 dd) 


> Launch log 


Fig. B.8 Click the instance ID. 


Name y 


Instance ID | Instancestate v | Instance type 


= Launch instances id ©Rumin QQ p2.xlarge 
Launch instance from template 


Migrate a server 


Connect | 


Fig. B.9 View the instance access method. 


you store D2L_key.pem and execute the following command to make the key not publicly 
viewable: 


chmod 400 D2L_key.pem 


Connect to instance info 
Connect to your instance i-07d488d640e6358dd using any of these options 


EC2 Instance Connect Session Manager SSH client EC2 serial console 
Instance ID 
i-07 dd 


1. Open an SSH client. 


2. Locate your private key file. The key used to launch this instance is D2L_key.pem 


3. Run this command, if necessary, to ensure your key is not publicly viewable. 
E chmod 400 D2L_key.pem 


4. Connect to your instance using its Public DNS: 


ec2- .compute.amazonaws.com 


Example: 


ssh -i "D2L_key.pem" ubuntu@ec2- -compute.amazonaws.com 


© Note: In most cases, the guessed user name is correct. However, read your AMI usage instructions to check if 
the AMI owner has changed the default AMI user name. 


ists 15) 110) View instance access and startup method. 


Now, copy the SSH command in the lower red box of Fig. B.10 and paste onto the command 
line: 
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ssh -i "D2L_key.pem” ubuntu@ec2-xx-xxx-xxx-xxx.y.compute. amazonaws.com 


When the command line prompts “Are you sure you want to continue connecting (yes/no)”, 
enter “yes” and press Enter to log into the instance. 


Your server is ready now. 


B.3.2 Installing CUDA 


Before installing CUDA, be sure to update the instance with the latest drivers. 


sudo apt-get update && sudo apt-get install -y build-essential git libgfortran3 


Here we download CUDA 12.1. Visit NVIDIA’s official repository??? to find the download 
link as shown in Fig. B.11. 


Home 


Select Target Platform 


Click on the green buttons that describe your target platform. Only supported platforms will be shown. By downloading and using the software, you agree to fully comply with the 
terms and conditions of the CUDA EULA. 


Operating System | tinu | Windows 

Architecture A Eee Eee 

Distribution | i ft ft E ff i ia 
| 


Version EI 2004 | 2204 | 
Installer Type E a 


Download Installer for Linux Ubuntu 22.04 x86_64 


The base installer is available for download below. 


Find the CUDA 12.1 download address. 


Copy the instructions and paste them onto the terminal to install CUDA 12.1. 


# The link and file name are subject to changes 

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_ 
—64/cuda-ubuntu2204.pin 

sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600 
wget https://developer.download.nvidia.com/compute/cuda/12.1.@/local_ 
~installers/cuda-repo-ubuntu2204-12-1-local_12.1.0-530.30.@2-1_amd64.deb 
sudo dpkg -i cuda-repo-ubuntu2204-12-1-local_12.1.0-530.30.02-1_amd64.deb 


(continues on next page) 
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(continued from previous page) 


sudo cp /var/cuda-repo-ubuntu2204-12-1-local/cuda-*-keyring.gpg /usr/share/ 
akeyrings/ 

sudo apt-get update 

sudo apt-get -y install cuda 


After installing the program, run the following command to view the GPUs: 


nvidia-smi 


Finally, add CUDA to the library path to help other libraries find it, such as appending the 
following lines to the end of ~/.bashrc. 


export PATH="/usr/local/cuda-12.1/bin: $PATH” 
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}: /usr/local/cuda-12.1/1ib64 


B.3.3 Installing Libraries for Running the Code 


To run the code of this book, just follow steps in Installation (page xxxiv) for Linux users on 
the EC2 instance and use the following tips for working on a remote Linux server: 


e To download the bash script on the Miniconda installation page, right click the download 
link and select “Copy Link Address”, then execute wget [copied link address]. 


e After running ~/miniconda3/bin/conda init, you may execute source ~/.bashrc 
instead of closing and reopening your current shell. 


B.3.4 Running the Jupyter Notebook remotely 


To run the Jupyter Notebook remotely you need to use SSH port forwarding. After all, the 
server in the cloud does not have a monitor or keyboard. For this, log into your server from 
your desktop (or laptop) as follows: 


# This command must be run in the local command line 
ssh -i "/path/to/key.pem” ubuntu@ec2-xx-xxx-xxx-xxx.y.compute.amazonaws.com -L_ 
8889: localhost : 8888 


Next, go to the location of the downloaded code of this book on the EC2 instance, then 
run: 


conda activate d21 
jupyter notebook 


Fig. B.12 shows the possible output after you run the Jupyter Notebook. The last row is the 
URL for port 8888. 


Since you used port forwarding to port 8889, copy the last row in the red box of Fig. B.12, 
replace “8888” with “8889” in the URL, and open it in your local browser. 
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| Output after running the Jupyter Notebook. The last row is the URL for port 8888. 


B.3.5 Closing Unused Instances 


As cloud services are billed by the time of use, you should close instances that are not being 
used. Note that there are alternatives: 


e “Stopping” an instance means that you will be able to start it again. This is akin to 
switching off the power for your regular server. However, stopped instances will still 
be billed a small amount for the hard disk space retained. 


e “Terminating” an instance will delete all data associated with it. This includes the disk, 
hence you cannot start it again. Only do this if you know that you will not need it in 
the future. 


If you want to use the instance as a template for many more instances, right-click on the 
example in Fig. B.9 and select “Image” — “Create” to create an image of the instance. Once 
this is complete, select “Instance State” — “Terminate” to terminate the instance. The next 
time you want to use this instance, you can follow the steps in this section to create an 
instance based on the saved image. The only difference is that, in “1. Choose AMI” shown 
in Fig. B.4, you must use the “My AMIs” option on the left to select your saved image. The 
created instance will retain the information stored on the image hard disk. For example, 
you will not have to reinstall CUDA and other runtime environments. 


B.3.6 Summary 


We can launch and stop instances on demand without having to buy and build our own 
computer. 


We need to install CUDA before using the GPU-enabled deep learning framework. 


e We can use port forwarding to run the Jupyter Notebook on a remote server. 


B.3.7 Exercises 


1. The cloud offers convenience, but it does not come cheap. Find out how to launch spot 
instances?°° to see how to reduce costs. 
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Fig. B.1 


Fig. B.2 


Using Google Colab 


2. Experiment with different GPU servers. How fast are they? 
3. Experiment with multi-GPU servers. How well can you scale things up? 


Discussions 2°! . 


B.4 Using Google Colab 


We introduced how to run this book on AWS in Section B.2 and Section B.3. Another 
option is running this book on Google Colab??? if you have a Google account. 


To run the code of a section on Colab, simply click the Colab button as shown in Fig. 


B.1. 


= 2.1. Data Manipulation Q 


2.1. Data Manipulation 


Run the code of a section on Colab 


If it is your first time to run a code cell, you will receive a warning message as shown in 
Fig. B.2. Just click “RUN ANYWAY” to ignore it. 


Warning: This notebook was not authored ... 


This notebook is being loaded from GitHub. It may request 
access to your data stored with Google, or read data and 
credentials from other sessions. Please review the source code 
before executing this notebook. 


CANCEL | RUN ANYWAY 


Ignore the warning message by clicking “RUN ANYWAY”. 


Next, Colab will connect you to an instance to run the code of this section. Specifically, 
if a GPU is needed, Colab will be automatically requested for connecting to a GPU in- 
stance. 


B.4.1 Summary 
e You can use Google Colab to run each section’s code in this book. 


e Colab will be requested to connect to a GPU instance if a GPU is needed in any section 
of this book. 
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B.4.2 Exercises 


1. Open any section of this book using Google Colab. 


2. Edit and run any section that requires a GPU using Google Colab. 


Discussions 223. 


B.5 Selecting Servers and GPUs 
——EEE——— ee 


Deep learning training generally requires large amounts of computation. At present GPUs 
are the most cost-effective hardware accelerators for deep learning. In particular, compared 
with CPUs, GPUs are cheaper and offer higher performance, often by over an order of 
magnitude. Furthermore, a single server can support multiple GPUs, up to 8 for high end 
servers. More typical numbers are up to 4 GPUs for an engineering workstation, since 
heat, cooling, and power requirements escalate quickly beyond what an office building can 
support. For larger deployments, cloud computing (e.g., Amazon’s P3 3°4 and G4 3° 
instances) is a much more practical solution. 


B.5.1 Selecting Servers 


There is typically no need to purchase high-end CPUs with many threads since much of 
the computation occurs on the GPUs. That said, due to the global interpreter lock (GIL) 
in Python single-thread performance of a CPU can matter in situations where we have 4-8 
GPUs. All things equal this suggests that CPUs with a smaller number of cores but a higher 
clock frequency might be a more economical choice. For example, when choosing between 
a 6-core 4 GHz and an 8-core 3.5 GHz CPU, the former is much preferable, even though 
its aggregate speed is less. An important consideration is that GPUs use lots of power and 
thus dissipate lots of heat. This requires very good cooling and a large enough chassis to 
use the GPUs. Follow the guidelines below if possible: 


1. Power Supply. GPUs use significant amounts of power. Budget with up to 350W per 
device (check for the peak demand of the graphics card rather than typical demand, since 
efficient code can use lots of energy). If your power supply is not up to the demand you 
will find that your system becomes unstable. 


2. Chassis Size. GPUs are large and the auxiliary power connectors often need extra space. 
Also, large chassis are easier to cool. 


3. GPU Cooling. If you have a large number of GPUs you might want to invest in water 
cooling. Also, aim for reference designs even if they have fewer fans, since they are 
thin enough to allow for air intake between the devices. If you buy a multi-fan GPU it 
might be too thick to get enough air when installing multiple GPUs and you will run 
into thermal throttling. 
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4. PCIe Slots. Moving data to and from the GPU (and exchanging it between GPUs) 
requires lots of bandwidth. We recommend PCIe 3.0 slots with 16 lanes. If you mount 
multiple GPUs, be sure to carefully read the motherboard description to ensure that 16x 
bandwidth is still available when multiple GPUs are used at the same time and that you 
are getting PCIe 3.0 as opposed to PCIe 2.0 for the additional slots. Some motherboards 
downgrade to 8x or even 4x bandwidth with multiple GPUs installed. This is partly due 
to the number of PCIe lanes that the CPU offers. 


In short, here are some recommendations for building a deep learning server: 


e Beginner. Buy a low end GPU with low power consumption (cheap gaming GPUs suit- 
able for deep learning use 150—200W). If you are lucky your current computer supports 
it. 

e 1 GPU. A low-end CPU with 4 cores will be sufficient and most motherboards suffice. 


Aim for at least 32 GB DRAM and invest into an SSD for local data access. A power 
supply with 600W should be sufficient. Buy a GPU with lots of fans. 


e 2 GPUs. A low-end CPU with 4-6 cores will suffice. Aim for 64 GB DRAM and invest 
into an SSD. You will need in the order of 1000W for two high-end GPUs. In terms 
of mainboards, make sure that they have two PCIe 3.0 x16 slots. If you can, get a 
mainboard that has two free spaces (60mm spacing) between the PCIe 3.0 x16 slots 
for extra air. In this case, buy two GPUs with lots of fans. 


e 4GPUs. Make sure that you buy a CPU with relatively fast single-thread speed (i.e., high 
clock frequency). You will probably need a CPU with a larger number of PCIe lanes, 
such as an AMD Threadripper. You will likely need relatively expensive mainboards 
to get 4 PCIe 3.0 x16 slots since they probably need a PLX to multiplex the PCIe lanes. 
Buy GPUs with reference design that are narrow and let air in between the GPUs. You 
need a 1600—2000W power supply and the outlet in your office might not support that. 
This server will probably run loud and hot. You do not want it under your desk. 128 
GB of DRAM is recommended. Get an SSD (1-2 TB NVMe) for local storage and a 
bunch of hard disks in RAID configuration to store your data. 


e 8 GPUs. You need to buy a dedicated multi-GPU server chassis with multiple redundant 
power supplies (e.g., 2+1 for 1600W per power supply). This will require dual socket 
server CPUs, 256 GB ECC DRAM, a fast network card (10 GBE recommended), 
and you will need to check whether the servers support the physical form factor of 
the GPUs. Airflow and wiring placement differ significantly between consumer and 
server GPUs (e.g., RTX 2080 vs. Tesla V100). This means that you might not be able 
to install the consumer GPU in a server due to insufficient clearance for the power cable 
or lack of a suitable wiring harness (as one of the coauthors painfully discovered). 


B.5.2 Selecting GPUs 


At present, AMD and NVIDIA are the two main manufacturers of dedicated GPUs. NVIDIA 
was the first to enter the deep learning field and provides better support for deep learning 
frameworks via CUDA. Therefore, most buyers choose NVIDIA GPUs. 
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NVIDIA provides two types of GPUs, targeting individual users (e.g., via the GTX and 
RTX series) and enterprise users (via its Tesla series). The two types of GPUs provide 
comparable compute power. However, the enterprise user GPUs generally use (passive) 
forced cooling, more memory, and ECC (error correcting) memory. These GPUs are more 
suitable for data centers and usually cost ten times more than consumer GPUs. 


If you are a large company with 100+ servers you should consider the NVIDIA Tesla series 
or alternatively use GPU servers in the cloud. For a lab or a small to medium company with 
10+ servers the NVIDIA RTX series is likely most cost effective. You can buy preconfig- 
ured servers with Supermicro or Asus chassis that hold 4-8 GPUs efficiently. 


GPU vendors typically release a new generation every one to two years, such as the GTX 
1000 (Pascal) series released in 2017 and the RTX 2000 (Turing) series released in 2019. 
Each series offers several different models that provide different performance levels. GPU 
performance is primarily a combination of the following three parameters: 


1. Compute Power. Generally we look for 32-bit floating-point compute power. 16-bit 
floating point training (FP16) is also entering the mainstream. If you are only interested 
in prediction, you can also use 8-bit integer. The latest generation of Turing GPUs offers 
4-bit acceleration. Unfortunately at the time of writing the algorithms for training low- 
precision networks are not yet widespread. 


2. Memory Size. As your models become larger or the batches used during training grow 
bigger, you will need more GPU memory. Check for HBM2 (High Bandwidth Memory) 
vs. GDDR6 (Graphics DDR) memory. HBM2 is faster but much more expensive. 


3. Memory Bandwidth. You can only get the most out of your compute power when you 
have sufficient memory bandwidth. Look for wide memory buses if using GDDR6. 


For most users, it is enough to look at compute power. Note that many GPUs offer different 
types of acceleration. For example, NVIDIA’s TensorCores accelerate a subset of opera- 
tors by 5x. Ensure that your libraries support this. The GPU memory should be no less 
than 4 GB (8 GB is much better). Try to avoid using the GPU also for displaying a GUI 
(use the built-in graphics instead). If you cannot avoid it, add an extra 2 GB of RAM for 
safety. 


Fig. B.1 compares the 32-bit floating-point compute power and price of the various GTX 
900, GTX 1000 and RTX 2000 series models. The prices suggested are those found on 
Wikipedia at the time of writing. 


We can see a number of things: 


1. Within each series, price and performance are roughly proportional. Titan models com- 
mand a significant premium for the benefit of larger amounts of GPU memory. How- 
ever, the newer models offer better cost effectiveness, as can be seen by comparing the 
980 Ti and 1080 Ti. The price does not appear to improve much for the RTX 2000 series. 
However, this is due to the fact that they offer far superior low precision performance 
(FP 16, INT8, and INT4). 
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Fig. B.1 Floating-point compute power and price comparison. 


2. The performance-to-cost ratio of the GTX 1000 series is about two times greater than 


the 900 series. 
3. For the RTX 2000 series the performance (in GFLOPs) is an affine function of the price. 
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Fig. B.2 Floating-point compute power and energy consumption. 


Fig. B.2 shows how energy consumption scales mostly linearly with the amount of com- 
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putation. Second, later generations are more efficient. This seems to be contradicted by 
the graph corresponding to the RTX 2000 series. However, this is a consequence of the 
TensorCores that draw disproportionately much energy. 


B.5.3 Summary 


e Watch out for power, PCIe bus lanes, CPU single thread speed, and cooling when build- 
ing a server. 


e You should purchase the latest GPU generation if possible. 
e Use the cloud for large deployments. 


e High density servers may not be compatible with all GPUs. Check the mechanical and 
cooling specifications before you buy. 


e Use FP16 or lower precision for high efficiency. 


Discussions?” . 


B.6 Contributing to This Book 
SS | 


Contributions by readers°” help us improve this book. If you find a typo, an outdated link, 
something where you think we missed a citation, where the code does not look elegant or 
where an explanation is unclear, please contribute back and help us help our readers. While 
in regular books the delay between print runs (and thus between typo corrections) can be 
measured in years, it typically takes hours to days to incorporate an improvement in this 
book. This is all possible due to version control and continuous integration (CI) testing. To 
do so you need to submit a pull request?°* to the GitHub repository. When your pull request 
is merged into the code repository by the authors, you will become a contributor. 


B.6.1 Submitting Minor Changes 


The most common contributions are editing one sentence or fixing typos. We recommend 
that you find the source file in the GitHub repository 3°9 and edit the file directly. For 
example, you can search the file through the Find file 2° button (Fig. B.1) to locate the 
source file (a markdown file). Then you click the “Edit this file” button on the upper-right 
corner to make your changes in the markdown file. 


After you are done, fill in your change descriptions in the “Propose file change” panel on 
the page bottom and then click the “Propose file change” button. It will redirect you to a 
new page to review your changes (Fig. B.7). If everything is good, you can submit a pull 
request by clicking the “Create pull request” button. 
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Edit the file on Github. 


B.6.2 Proposing Major Changes 


If you plan to update a large portion of text or code, then you need to know a little bit more 
about the format this book is using. The source file is based on the markdown format! 
with a set of extensions through the D2L-Book?!” package such as referring to equations, 
images, chapters, and citations. You can use any markdown editors to open these files and 
make your changes. 


If you would like to change the code, we recommend that you use the Jupyter Notebook to 
open these markdown files as described in Section B.1, so that you can run and test your 
changes. Please remember to clear all outputs before submitting your changes since our CI 
system will execute the sections you updated to generate outputs. 


Some sections may support multiple framework implementations. If you add a new code 
block, please use %%tab to mark this block on the beginning line. For example, %%tab 
pytorch for a PyTorch code block, %*%tab tensorflow for a TensorFlow code block, or 
%%tab all a shared code block for all implementations. You may refer to the d21book 
package for more information. 


B.6.3 Submitting Major Changes 


We suggest you to use the standard Git process to submit a major change. In a nutshell the 
process works as described in Fig. B.2. 


merge pull push 
request 


GitHub fore = GitHub 
d2l-ai/d2|-en user/d2l-en 


Contributing to the book. 


local copy 


clone —>| 
d2l-en 


= ge | 


We will walk you through the steps in detail. If you are already familiar with Git you 
can skip this section. For concreteness we assume that the contributor’s user name is “as- 
tonzhang”’. 
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Installing Git 


The Git open-source book describes how to install Git?!%. This typically works via apt 
install git on Ubuntu Linux, by installing the Xcode developer tools on macOS, or by 
using GitHub’s desktop client?!*. If you do not have a GitHub account, you need to sign 
up for one. 


Logging in to GitHub 


Enter the address 315 of the book’s code repository in your browser. Click on the Fork 
button in the red box at the upper-right of Fig. B.3, to make a copy of the repository of this 
book. This is now your copy and you can change it any way you want. 


[=] d2l-ai / d2l-en Public & Edit Pins ~ @Unwatch 352 ~ | vr sw |- vy Starred 15.5k a 


<> Code © Issues 58 31 Pullrequests 17 Q) Discussions ©) Actions fF Projects © Security l~ Insights 


The code repository page. 


Now, the code repository of this book will be forked (i.e., copied) to your username, such 
as astonzhang/d21-en shown at the upper-left of Fig. B.4. 


¥ astonzhang/d2l-en Public & Pin @ Watch 0 ~ Fork 3.4k - yy Star 0 v 
forked from d2l-ai/d2l-en 


<> Code ĵù Pullrequests ©) Actions (H Projects © Security l~ Insights © Settings 


The forked code repository. 


Cloning the Repository 


To clone the repository (i.e., to make a local copy) we need to get its repository address. 
The green button in Fig. B.5 displays this. Make sure that your local copy is up to date 
with the main repository if you decide to keep this fork around for longer. For now simply 
follow the instructions in Jnstallation (page xxxiv) to get started. The main difference is 
that you are now downloading your own fork of the repository. 


F master ~ ¥P 1branch OO tags Go to file Add file ~ 


Local Codespaces 
This branch is up to date with d2I-ai/d2l-en:master. 


EJ Clone © 


I@) astonzhang Update README.md HTTPS SSH GitHub CLI 


® github Update PULL_REQUEST_T https://github. com/astonzhang/d2l-en.git G 


Fig. B.5 Cloning the repository. 
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# Replace your_github_username with your GitHub username 
git clone https://github.com/your_github_username/d21l-en. git 


Editing and Pushing 


Now it is time to edit the book. It is best to edit it in the Jupyter Notebook following instruc- 
tions in Section B.1. Make the changes and check that they are OK. Assume that we have 
modified a typo in the file ~/d21-en/chapter_appendix-tools-for-deep-learning/ 
contributing.md. You can then check which files you have changed. 


At this point Git will prompt that the chapter_appendix-tools-for-deep-learning/ 
contributing.md file has been modified. 


mylaptop:d2l-en me$ git status 
On branch master 
Your branch is up-to-date with 'origin/master’. 


Changes not staged for commit: 
(use "git add <file>..." to update what will be committed) 
(use "git checkout -- <file>...” to discard changes in working directory) 


modified: chapter_appendix-tools-for-deep-learning/contributing.md 


After confirming that this is what you want, execute the following command: 


git add chapter_appendix-tools-for-deep-learning/contributing.md 
git commit -m 'Fix a typo in git documentation’ 
git push 


The changed code will then be in your personal fork of the repository. To request the 
addition of your change, you have to create a pull request for the official repository of the 
book. 


Submitting Pull Requests 


As shown in Fig. B.6, go to your fork of the repository on GitHub and select “New pull 
request”. This will open up a screen that shows you the changes between your edits and 
what is current in the main repository of the book. 


¥ astonzhang/d2l-en Public X Pir @Watch 0 ~ Y& Fork 3.4k ~ vy Star o ~ 
forked from d2l-ai/d2l-en 


<> Code ù Pullrequests © Actions (H Projects © Security | Insights © Settings 


Filters ~ Q is:pr is:open Q Labels 9 © Milestones 0 New pull request 


New pull request. 
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Finally, submit a pull request by clicking the button as shown in Fig. B.7. Make sure to 
describe the changes you have made in the pull request. This will make it easier for the 
authors to review it and to merge it with the book. Depending on the changes, this might 
get accepted right away, rejected, or more likely, you will get some feedback on the changes. 
Once you have incorporated them, you are good to go. 


Comparing changes 


Choose two branches to see what's changed or to start a new pull request. If you need to, you can also compare across forks. 


t base repository: d2l-ai/d2l-en ~ base: master v € head repository: astonzhang/d2l-en ~ compare: master v 


v Able to merge. These branches can be automatically merged. 


Discuss and review the changes in this comparison with others. Learn about pull requests Create pull request 


Create pull request. 


B.6.4 Summary 
e You can use GitHub to contribute to this book. 
e You can edit the file on GitHub directly for minor changes. 


e For a major change, please fork the repository, edit things locally, and only contribute 
back once you are ready. 


e Pull requests are how contributions are being bundled up. Try not to submit huge pull 
requests since this makes them hard to understand and incorporate. Better send several 
smaller ones. 


B.6.5 Exercises 
1. Star and fork the d21-ai/d21-en repository. 


2. If you spot anything that needs improvement (e.g., missing a reference), submit a pull 
request. 


3. It is usually a better practice to create a pull request using a new branch. Learn how to 
do it with Git branching ?!6, 


Discussions?!" . 


B.7 Utility Functions and Classes 
| 


This section contains the implementations of utility functions and classes used in this 
book. 
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import collections 

import inspect 

from IPython import display 
from torch import nn 

from d21 import torch as d21 


Hyperparameters. 


@d21.add_to_class(d21.HyperParameters) #@save 
def save_hyperparameters(self, ignore=[]): 
"""Save function arguments into class attributes. 
frame = inspect.currentframe() .f_back 
_, -, —, local_vars = inspect.getargvalues(frame) 
self.hparams = {k:v for k, v in local_vars.items() 
if k not in set(ignore+['self']) and not k.startswith('_')} 
for k, v in self.hparams.items(): 
setattr(self, k, v) 


nnn 


Progress bar. 


@d21.add_to_class(d21.ProgressBoard) #@save 
def draw(self, x, y, label, every_n=1): 
Point = collections.namedtuple('Point’, ['x', ‘y']) 
if not hasattr(self, 'raw_points’): 
self.raw_points = collections.OrderedDict() 
self.data = collections.OrderedDict() 
if label not in self.raw_points: 
self.raw_points[label] = [] 
self.data[label] = [] 
points = self.raw_points[label] 
line = self.data[label] 
points.append(Point(x, y)) 
if len(points) != every_n: 
return 
mean = lambda x: sum(x) / len(x) 
line.append(Point(mean([p.x for p in points]), 
mean([p.y for p in points]))) 
points.clear() 
if not self.display: 
return 
d21.use_svg_display() 
if self.fig is None: 
self.fig = d21.plt.figure(figsize=self.figsize) 
plt_lines, labels = [], [J 
for (k, v), ls, color in zip(self.data.items(), self.ls, self.colors): 
plt_lines.append(d21.plt.plot([p.x for p in v], [p.y for p in v], 
linestyle=ls, color=color)[@]) 
labels. append (k) 
axes = self.axes if self.axes else d21.plt.gca() 
if self.xlim: axes.set_xlim(self.xlim) 
if self.ylim: axes.set_ylim(self.ylim) 
if not self.xlabel: self.xlabel = self.x 
axes.set_xlabel(self.xlabel) 
axes.set_ylabel(self.ylabel) 
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axes.set_xscale(self.xscale) 
axes.set_yscale(self.yscale) 
axes. legend(plt_lines, labels) 
display.display(self.fig) 
display.clear_output (wait=True) 


Add FrozenLake enviroment 


def 


frozen_lake(seed): #@save 
# See https: //www.gymlibrary.dev/environments/toy_text/frozen_lake/ to. 


«learn more about this env 


# How to process env.P.items is adpated from https://sites.google.com/view/ 


—deep-rl-bootcamp/labs 


«means 


«state dim 


tuple 


=} 


import gym 


env = gym.make('FrozenLake-v1', is_slippery=False) 

env. seed(seed) 

env.action_space.np_random. seed(seed) 

env.action_space.seed(seed) 

env_info = {} 

env_infoL'’desc’] = env.desc # 2D array specifying what each grid item, 


env_info['num_states’] = env.nS # Number of observations/states or obs/ 


env_info['num_actions’] = env.nA # Number of actions or action dim 
# Define indices for (transition probability, nextstate, reward, done). 


env_info[’trans_prob_idx’] = @ # Index of transition probability entry 


env_info['nextstate_idx'] = 1 # Index of next state entry 
env_infoL’reward_idx'] = 2 # Index of reward entry 
env_info['done_idx'] = 3 # Index of done entry 
env_infoL’mdp'] = {} 

env_infoL’env'] = env 


for (s, others) in env.P.items(): 
# others(s) = {a@: [ (p(s'|s,a@), s’, reward, done),...], al:[...], 


for (a, pxrds) in others.items(): 
# pxrds is [(p1,next1,r1,d1),(p2,next2,r2,d2),..]. 
# e.g. [(0.3, 0, 0, False), (0.3, @, 0, False), (0.3, 4, 1, False)] 
env_infoL’mdp'J[(s,a)] = pxrds 


return env_info 


Create enviroment 


def 


ne 


make_env(name ='', seed=0): #@save 
# Input parameters: 
# name: specifies a gym environment. 
# For Value iteration, only FrozenLake-v1 is supported. 
if name == 'FrozenLake-v1’: 
return frozen_lake(seed) 
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else: 
raise ValueError("%s env is not supported in this Notebook”) 


Show value function 


def show_value_function_progress(env_desc, V, pi): #@save 

# This function visualizes how value and policy changes over time. 

# V: [num_iters, num_states] 

# pi: [num_iters, num_states] 

# How to visualize value function is adapted (but changed) from: https:// 
sites. google.com/view/deep-rl-bootcamp/labs 


num_iters = V.shape[Q] 
fig, ax = plt.subplots(figsize=(15, 15)) 


for k in range(V.shape[@]): 
plt.subplot(4, 4, k + 1) 
plt.imshow(V[k].reshape(4,4), cmap="bone") 
ax = plt.gca() 
ax.set_xticks(np.arange(Q, 5)-.5, minor=True) 
ax.set_yticks(np.arange(9, 5)-.5, minor=True) 
ax.grid(which="minor", color="w", linestyle='’-', linewidth=3) 
ax.tick_params(which="minor"”, bottom=False, left=False) 
ax.set_xticks([]) 
ax.set_yticks([]) 


# LEFT action: @, DOWN action: 1 

# RIGHT action: 2, UP action: 3 

action2dxdy = {0@:(-.25, 0),1: (@, .25), 
2HO Pa, O38 (].25, OD} 


for y in range(4): 
for x in range(4): 
action = pilk].reshape(4,4)Ly, x] 
dx, dy = action2dxdy[action] 


if env_desc[y,x].decode() == 'H’: 
ax.text(x, y, str(env_descLy,x].decode()), 


ha="center”, va="center”, color="y 
size=20, fontweight='bold’) 


elif env_descLy,x].decode() == 'G': 
ax.text(x, y, str(env_descLy,x].decode()), 
ha="center”, va="center”, color="w", 
size=20, fontweight='bold’) 


else: 
ax.text(x, y, str(env_descLy,x].decode()), 
ha="center”, va="center”, color="g", 
size=15, fontweight='bold’) 


# No arrow for cells with G and H labels 
if env_descL[y,x].decode() != 'G' and env_descLy,x].decode() != 
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eras ye 
ax.arrow(x, y, dx, dy, color='r', head_width=0.2, head_ 
slength=0.15) 


ax.set_title("Step = ” + str(k + 1), fontsize=20) 


fig.tight_layout() 
plt.show() 


Show Q function 


def show_Q_function_progress(env_desc, V_all, pi_all): #@save 
# This function visualizes how value and policy changes over time. 
# V: [num_iters, num_states] 
# pi: [num_iters, num_states] 


# We want to only shows few values 
num_iters_all = V_all.shape[@] 
num_iters = num_iters_all // 10 


vis_indx = np.arange(@, num_iters_all, num_iters).tolist() 
vis_indx.append(num_iters_all - 1) 

V = np.zeros((len(vis_indx), V_all.shape[1])) 

pi = np.zeros((len(vis_indx), V_all.shape[1])) 


for c, i in enumerate(vis_indx): 
Vic] = V_all[i] 
pilc] = pi_all[i] 


num_iters = V.shape[Q] 
fig, ax = plt.subplots(figsize=(15, 15)) 


for k in range(V.shape[Q]): 
plt.subplot(4, 4, k + 1) 
plt.imshow(V[k].reshape(4,4), cmap="bone") 
ax = plt.gca() 
ax.set_xticks(np.arange(Q, 5)-.5, minor=True) 
ax.set_yticks(np.arange(9, 5)-.5, minor=True) 
ax.grid(which="minor", color="w", linestyle='-', linewidth=3) 
ax.tick_params(which="minor", bottom=False, left=False) 
ax.set_xticks([]) 
ax.set_yticks([]) 


# LEFT action: @, DOWN action: 1 

# RIGHT action: 2, UP action: 3 

action2dxdy = {0:(-.25, 0),1:(@, .25), 
BOP, 5 38(en2a, Oy} 


for y in range(4): 
for x in range(4): 
action = pilk].reshape(4,4)Ly, x] 
dx, dy = action2dxdy[action] 


Ug 


if env_desc[y,x].decode() == 'H’: 
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ax.text(x, y, str(env_descLy,x].decode()), 


ha="center”, va="center”, color="y", 
size=20, fontweight='bold’) 


elif env_descLy,x].decode() == 'G': 
ax.text(x, y, str(env_descLy,x].decode()), 


ha="center”, va="center”, color="w", 
size=20, fontweight='bold’) 


else: 
ax.text(x, y, str(env_descLy,x].decode()), 


ha="center”, va="center”, color="g", 
size=15, fontweight='bold’) 


# No arrow for cells with G and H labels 
if env_descLy,x].decode() != 'G’ and env_descL[y,x].decode() != 
es Hs 
ax.arrow(x, y, dx, dy, color='r', head_width=0.2, head_ 
slength=0.15) 


ax.set_title("Step = ” + str(vis_indx[k] + 1), fontsize=20) 


fig.tight_layout() 
plt.show() 


Trainer 


A bunch of functions that will be deprecated: 


def load_array(data_arrays, batch_size, is_train=True): #@save 
"""Construct a PyTorch data iterator.""” 
dataset = torch.utils.data.TensorDataset (*data_arrays) 
return torch.utils.data.DataLoader(dataset, batch_size, shuffle=is_train) 


def synthetic_data(w, b, num_examples): #@save 
"""Generate y = Xw + b + noise.”"” 
X = torch.normal(@, 1, (num_examples, len(w))) 
y = torch.matmul(X, w) + b 
y += torch.normal(@, @.01, y.shape) 
return X, y.reshape((-1, 1)) 


def sgd(params, lr, batch_size): #@save 
"""Minibatch stochastic gradient descent. 
with torch.no_grad(): 
for param in params: 
param -= lr * param.grad / batch_size 
param. grad. zero_() 


nnn 


def get_dataloader_workers(): #@save 
"""Use 4 processes to read the data. 
return 4 


nnn 


def load_data_fashion_mnist(batch_size, resize=None): #@save 
"""Download the Fashion-MNIST dataset and then load it into memory. 


nnn 
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trans = [transforms.ToTensor()] 
if resize: 
trans.insert(@, transforms.Resize(resize)) 
trans = transforms.Compose(trans) 
mnist_train = torchvision.datasets.FashionMNIST( 


root="../data”, train=True, transform=trans, download=True) 
mnist_test = torchvision.datasets.FashionMNIST( 
root="../data”, train=False, transform=trans, download=True) 


return (torch.utils.data.DataLoader(mnist_train, batch_size, shuffle=True, 
num_workers=get_dataloader_workers()), 
torch.utils.data.DataLoader(mnist_test, batch_size, shuffle=False, 
num_workers=get_dataloader_workers())) 


def evaluate_accuracy_gpu(net, data_iter, device=None): #@save 
"""Compute the accuracy for a model on a dataset using a GPU.””” 
if isinstance(net, nn.Module): 
net.eval() # Set the model to evaluation mode 
if not device: 
device = next(iter(net.parameters())).device 
# No. of correct predictions, no. of predictions 
metric = d21.Accumulator(2) 
with torch.no_grad(): 
for X, y in data_iter: 
if isinstance(X, list): 
# Required for BERT Fine-tuning (to be covered later) 
X = [x.to(device) for x in X] 
else: 
X = X.to(device) 
y = y.to(device) 
metric.add(d21.accuracy(net(X), y), y.numel()) 
return metric[Q@] / metric[1] 
#@save 


def train_ch6(net, train_iter, test_iter, num_epochs, Ilr, device): 


"""Train a model with a GPU (defined in Chapter 6).””” 
def init_weights(m): 
if type(m) == nn.Linear or type(m) == nn.Conv2d: 
nn.init.xavier_uniform_(m.weight) 
net.apply(init_weights) 
print(’training on’, device) 
net. to(device) 
optimizer = torch.optim.SGD(net.parameters(), lr=1r) 
loss = nn.CrossEntropyLoss() 
animator = d21.Animator(xlabel='epoch’, xlim=[1, num_epochs], 
legend=['train loss', ‘train acc’, ‘test acc']) 
timer, num_batches = d21.Timer(), len(train_iter) 
for epoch in range(num_epochs) : 
# Sum of training loss, sum of training accuracy, no. of examples 
metric = d21.Accumulator (3) 
net.train() 
for i, (X, y) in enumerate(train_iter): 
timer.start() 
optimizer .zero_grad() 
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y = X.to(device), y.to(device) 
at = net(X) 
loss(y_hat, y) 
1. backward() 
optimizer.step() 
with torch.no_grad(): 
metric.add(l * X.shape[@], d2l.accuracy(y_hat, y), X.shape[@]) 
timer.stop() 
train_l = metric[9] / metric[2] 
train_acc = metric[1] / metric[2] 
if (i + 1) % (num_batches // 5) == @ or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(train_l, train_acc, None)) 
test_acc = evaluate_accuracy_gpu(net, test_iter) 
animator.add(epoch + 1, (None, None, test_acc)) 
print(f'loss {train_1:.3f}, train acc {train_acc: .3f}, 
f'test acc {test_acc: .3f}') 
print(f'’{metricl2] * num_epochs / timer.sum():.1f} examples/sec 
f'on {str(device) }') 


X, 
y_h 
l= 


' 


' 


show_images(imgs, num_rows, num_cols, titles=None, scale=1.5): #@save 
VON Nie 2) ANSE OF images a ni 
figsize = (num_cols * scale, num_rows * scale) 
_, axes = d21.plt.subplots(num_rows, num_cols, figsize=figsize) 
axes = axes. flatten() 
for i, (ax, img) in enumerate(zip(axes, imgs)): 
Ery: 
img = img.detach() .numpy() 
except: 
pass 
ax. imshow(img) 
ax.axes.get_xaxis().set_visible(False) 
ax.axes.get_yaxis().set_visible(False) 
if titles: 
ax.set_title(titles[i]) 
return axes 


linreg(X, w, b): #@save 
"""The linear regression model.”"” 
return torch.matmul(X, w) + b 


squared_loss(y_hat, y): #@save 
"Squared loss." 
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2 


get_fashion_mnist_labels(labels): #@save 

"""Return text labels for the Fashion-MNIST dataset.”"” 

text_labels = ['t-shirt’, ‘trouser’, ‘pullover’, ‘dress’, ‘coat’, 
‘sandal’, ‘shirt’, ‘sneaker’, ‘bag’, ‘ankle boot'] 

return [text_labels[int(i)] for i in labels] 


class Animator: #@save 


nnn 


"""For plotting data in animation. 
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def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None, 
ylim=None, xscale='linear’, yscale='linear’, 
fmts=('-', 'm--', 'g-.', ‘r:'), nrows=1, ncols=1, 
figsize=(3.5, 2.5)): 
# Incrementally plot multiple lines 
if legend is None: 
legend = [] 
d21.use_svg_display() 
self.fig, self.axes = d21.plt.subplots(nrows, ncols, figsize=figsize) 
if nrows * ncols == 
self.axes = [self.axes, ] 
# Use a lambda function to capture arguments 
self.config_axes = lambda: d21.set_axes( 
self.axes[@], xlabel, ylabel, xlim, ylim, xscale, yscale, legend) 
self.X, self.Y, self.fmts = None, None, fmts 


def add(self, x, y): 
# Add multiple data points into the figure 


if not hasattr(y, "__len__"): 
y = Ly] 

n = len(y) 

if not hasattr(x, "__len__"): 


x = [x] n 
if not self.X: 
self.X = [CL] for _ in range(n)] 
if not self.Y: 
self.Y = [[] for _ in range(n)] 
for i, (a, b) in enumerate(zip(x, y)): 
if a is not None and b is not None: 
self .X[i].append(a) 
self. Y[Li].append(b) 
self.axes[0].cla() 
for x, y, fmt in zip(self.X, self.Y, self.fmts): 
self.axes[0].plot(x, y, fmt) 
self .config_axes() 
display.display(self.fig) 
display.clear_output (wait=True) 


class Accumulator: #@save 
"""For accumulating sums over ‘n* variables. 
def __init__(self, n): 
self.data = [0.0] x n 


nnn 


def add(self, xargs): 
self.data = [a + float(b) for a, b in zip(self.data, args)] 


def reset(self): 
self.data = [0.0] x len(self.data) 


def __getitem__(self, idx): 


return self.data[idx] 


def accuracy(y_hat, y): #@save 
"""Compute the number of correct predictions. 


nnn 
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if len(y_hat.shape) > 1 and y_hat.shape[1] > 1: 
y_hat = y_hat.argmax(axis=1) 

cmp = y_hat.type(y.dtype) == y 

return float(cmp. type(y.dtype) .sum()) 


import hashlib 
import os 

import tarfile 
import zipfile 
import requests 


def download(url, folder='../data’, shal_hash=None): #@save 
"""Download a file to folder and return the local filepath.”"” 
if not url.startswith('http’): 
# For back compatability 
url, shal_hash = DATA_HUB[ur1] 
os.makedirs(folder, exist_ok=True) 
fname = os.path.join(folder, url.split('/’)[-1]) 
# Check if hit cache 
if os.path.exists(fname) and shal_hash: 
shal = hashlib.sha1() 
with open(fname, 'rb’) as f: 
while True: 
data = f.read(1048576) 
if not data: 
break 
shal.update(data) 
if shal.hexdigest() == shal_hash: 
return fname 
# Download 
print(f'Downloading {fname} from {url}...’) 
r = requests.get(url, stream=True, verify=True) 
with open(fname, ‘wb’) as f: 
f .write(r.content) 
return fname 


def extract(filename, folder=None): #@save 
" "Extract a zip/tar file into folden. mis 
base_dir = os.path.dirname(filename) 
_, ext = os.path.splitext(filename) 


assert extrim eZip ee entan eZ) ONL ye SUppontecdip/ tan iilesm= 
if ext == ’.zip’: 

fp = zipfile.ZipFile(filename, 'r’) 
else: 


fp = tarfile.open(filename, ‘r’) 
if folder is None: 

folder = base_dir 
fp.extractall(folder) 


def download_extract(name, folder=None): #@save 
"""Download and extract a zip/tar file.””” 
fname = download(name) 
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base_dir = os.path.dirname(fname) 
data_dir, ext = os.path.splitext(fname) 


if ext == '.zip’: 

fp = zipfile.ZipFile(fname, ‘r’) 
elif ext in G. tami "of Ne 

fp = tarfile.open(fname, 'r') 
else: 


assert False, 'Only zip/tar files can be extracted.’ 
fp.extractall(base_dir) 
return os.path.join(base_dir, folder) if folder else data_dir 


def tokenize(lines, token='word'): #@save 
"""Split text lines into word or character tokens. 
assert token in ('word’, ‘char'), ‘Unknown token type: + token 
return [line.split() if token == ‘word’ else list(line) for line in lines] 


nnn 


' 


def evaluate_loss(net, data_iter, loss): #@save 

"""Evaluate the loss of a model on the given dataset. 
metric = d2l.Accumulator(2) # Sum of losses, no. of examples 
for X, y in data_iter: 

out = net(X) 

y = y.reshape(out.shape) 

1 = loss(out, y) 

metric.add(l.sum(), 1.numel()) 
return metric[9] / metric[1] 


nnn 


def grad_clipping(net, theta): #@save 

"""Clip the gradient.””” 
if isinstance(net, nn.Module): 

params = [p for p in net.parameters() if p.requires_grad] 
else: 

params = net.params 
norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params)) 
if norm > theta: 

for param in params: 

param.grad[:] *= theta / norm 


More for the attention chapter. 


#@save 
d21.DATA_HUB['fra-eng’] = (d21.DATA_URL + 'fra-eng.zip’, 
'94646ad1522d915e7b0F9296181140edcf86a4Ff5') 


#@save 
def read_data_nmt(): 
"""Load the English-French dataset.”"” 
data_dir = d21.download_extract('fra-eng’) 
with open(os.path.join(data_dir, ‘fra.txt'), 'r’, encoding='utf-8') as f: 
return f.read() 


#@save 


(continues on next page) 


1071 


Utility Functions and Classes 


(continued from previous page) 


def preprocess_nmt(text): 
"""Preprocess the English-French dataset. 
def no_space(char, prev_char): 
return char in set(’,.!?') and prev_char != 


nnn 


(3 


# Replace non-breaking space with space, and convert uppercase letters to 
# lowercase ones 
text = text.replace(’\u202f', ' ').replace('\xa@’, ' ').lower() 
# Insert space between words and punctuation marks 
out = [’ ' + char if i > @ and no_space(char, text[i - 1]) else char 
for i, char in enumerate(text) ] 
return ''.join(out) 


#@save 
def tokenize_nmt(text, num_examples=None) : 
"""Tokenize the English-French dataset. 
source, target = [], [] 
for i, line in enumerate(text.split(’\n')): 
if num_examples and i > num_examples: 
break 
parts = line.split('\t’) 
if len(parts) == 2: 
source.append(parts[@].split(’ ')) 
target.append(parts[1].split(’ ’)) 
return source, target 


nnn 


#@save 
def truncate_pad(line, num_steps, padding_token): 
"""Truncate or pad sequences.””"” 
if len(line) > num_steps: 
return line[:num_steps] # Truncate 
return line + [padding_token] * (num_steps - len(line)) # Pad 


#@save 
def build_array_nmt(lines, vocab, num_steps): 
"""Transform text sequences of machine translation into minibatches. 
lines = [vocab[1] for 1 in lines] 
lines = [1 + [vocab['<eos>’]] for 1 in lines] 
array = torch.tensor([truncate_pad( 
1, num_steps, vocab[’<pad>']) for 1 in lines]) 
valid_len = (array != vocab[’<pad>’]).type(torch. int32).sum(1) 
return array, valid_len 


nnn 


#@save 
def load_data_nmt(batch_size, num_steps, num_examples=600) : 
"""Return the iterator and the vocabularies of the translation dataset. 
text = preprocess_nmt(read_data_nmt()) 
source, target = tokenize_nmt(text, num_examples) 
src_vocab = d21.Vocab(source, min_freq=2, 
reserved_tokens=['<pad>’, '<bos>’, '<eos>']) 
tgt_vocab = d21.Vocab(target, min_freq=2, 
reserved_tokens=['’<pad>’, '<bos>’, '<eos>']) 


nnn 
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src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps) 
tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps) 
data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len) 
data_iter = d21.load_array(data_arrays, batch_size) 

return data_iter, src_vocab, tgt_vocab 


#@save 
def sequence_mask(X, valid_len, value=): 
"""Mask irrelevant entries in sequences. 
maxlen = X.size(1) 
mask = torch.arange((maxlen), dtype=torch.float32, 
device=X.device)[None, :] < valid_len[:, None] 


nnn 


X[~mask] = value 
return X 


#@save 
class MaskedSoftmaxCELoss(nn.CrossEntropyLoss) : 
"""The softmax cross-entropy loss with masks. 
# ‘pred* shape: (‘batch_size*‘, ‘num_steps‘, ‘vocab_size*) 
# ‘label* shape: (‘batch_size‘, ‘num_steps*‘) 
# ‘valid_len* shape: (‘batch_size‘,) 
def forward(self, pred, label, valid_len): 
weights = torch.ones_like(label) 
weights = sequence_mask(weights, valid_len) 
self.reduction='none’ 
unweighted_loss = super(MaskedSoftmaxCELoss, self) .forward( 
pred.permute(@, 2, 1), label) 
weighted_loss = (unweighted_loss * weights) .mean(dim=1) 
return weighted_loss 


nnn 


#@save 
def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device): 
"""Train a model for sequence to sequence.”””" 
def xavier_init_weights(m): 
if type(m) == nn.Linear: 
nn.init.xavier_uniform_(m.weight) 
if type(m) == nn.GRU: 
for param in m._flat_weights_names: 
if "weight” in param: 
nn.init.xavier_uniform_(m._parameters[param]) 
net.apply(xavier_init_weights) 
net. to(device) 
optimizer = torch.optim.Adam(net.parameters(), lr=1r) 
loss = MaskedSoftmaxCELoss() 
net.train() 
animator = d21.Animator(xlabel='epoch’, ylabel='loss’, 
xlim=[10, num_epochs]) 
for epoch in range(num_epochs) : 
timer = d21.Timer() 
metric = d21.Accumulator(2) # Sum of training loss, no. of tokens 
for batch in data_iter: 
optimizer .zero_grad() 
X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch] 
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bos = torch. tensor([tgt_vocab['<bos>']] * Y.shape[@], 
device=device).reshape(-1, 1) 
dec_input = torch.cat([bos, Y[:, :-1]], 1) # Teacher forcing 
Y_hat, _ = net(X, dec_input, X_valid_len) 
1 = loss(Y_hat, Y, Y_valid_len) 
1l.sum().backward() # Make the loss scalar for ‘backward* 
d21.grad_clipping(net, 1) 
num_tokens = Y_valid_len.sum() 
optimizer.step() 
with torch.no_grad(): 
metric.add(1.sum(), num_tokens) 
if (epoch + 1) % 10 == Q: 
animator.add(epoch + 1, (metric[@] / metric[1],)) 
print(f'loss {metriclQ] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} ' 
f'tokens/sec on {str(device) }’) 


#@save 
def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps, 
device, save_attention_weights=False): 
"""Predict for sequence to sequence.””” 
# Set ‘net to eval mode for inference 
net.eval() 
src_tokens = src_vocab[src_sentence.lower().split(’ ')] + [ 
src_vocab[ '<eos>']] 
enc_valid_len = torch.tensor([len(src_tokens)], device=device) 
src_tokens = d21.truncate_pad(src_tokens, num_steps, src_vocab[ '<pad>’]) 
# Add the batch axis 
enc_X = torch.unsqueeze( 
torch. tensor(src_tokens, dtype=torch.long, device=device), dim=Q) 
enc_outputs = net.encoder(enc_X, enc_valid_len) 
dec_state = net.decoder.init_state(enc_outputs, enc_valid_len) 
# Add the batch axis 
dec_X = torch.unsqueeze(torch. tensor( 
[tgt_vocabL’<bos>']], dtype=torch.long, device=device), dim=0) 
output_seq, attention_weight_seq = [], [] 
for _ in range(num_steps): 
Y, dec_state = net.decoder(dec_X, dec_state) 
# We use the token with the highest prediction likelihood as input 
# of the decoder at the next time step 
dec_X = Y.argmax(dim=2) 
pred = dec_X.squeeze(dim=0) . type(torch. int32) .item() 
# Save attention weights (to be covered later) 
if save_attention_weights: 
attention_weight_seq. append(net.decoder.attention_weights) 
# Once the end-of-sequence token is predicted, the generation of the 
# output sequence is complete 
if pred == tgt_vocab['<eos>']: 
break 
output_seq. append(pred) 
return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq 
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B.8 The d21 API Document 


This section displays classes and functions (sorted alphabetically) in the d21 package, 
showing where they are defined in the book so you can find more detailed implementa- 
tions and explanations. See also the source code on the GitHub repository 318. 


B.8.1 Classes 


class d21.torch.AdditiveAttention(num_hiddens, dropout, **kwargs) 


Bases: Module 
Additive attention. 
Defined in Section 11.3.2 


forward (queries, keys, values, valid_lens) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.AddNorm(norm_shape, dropout) 


Bases: Module 

The residual connection followed by layer normalization. 
Defined in Section 11.7.2 

forward(Xx, Y) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.AttentionDecoder 


Bases: Decoder (page 1075) 


The base attention-based decoder interface. 
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Defined in Section 11.4 
property attention_weights 
class d21.torch.Classifier (plot_train_per_epoch=2, plot_valid_per_epoch=1) 
Bases: Module (page 1078) 
The base class of classification models. 
Defined in Section 4.3 


accuracy (Y_hat, Y, averaged=True) 


Compute the number of correct predictions. 
Defined in Section 4.3 


layer_summary (X_shape) 
Defined in Section 7.6 


loss(Y_hat, Y, averaged=True) 
Defined in Section 4.5 


validation_step (batch) 
class d21.torch.DataModule(root=’../data’, num_workers=4) 
Bases: HyperParameters (page 1077) 
The base class of data. 
Defined in Section 3.2.2 
get_dataloader (train) 
get_tensor loader (tensors, train, indices=slice(0, None, None)) 
Defined in Section 3.3 
train_dataloader() 
val_dataloader() 
class d21.torch.Decoder 
Bases: Module 
The base decoder interface for the encoder—decoder architecture. 
Defined in Section 10.6 


forward(X, state) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 
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Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


init_state(enc_all_outputs, *args) 
class d21.torch.DotProductAttention (dropout) 
Bases: Module 
Scaled dot product attention. 
Defined in Section 11.3.2 


forward (queries, keys, values, valid_lens=None) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.Encoder 


Bases: Module 
The base encoder interface for the encoder—decoder architecture. 
Defined in Section 10.6 


forward(X, *args) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.EncoderDecoder (encoder, decoder) 


Bases: Classifier (page 1075) 
The base class for the encoder—decoder architecture. 


Defined in Section 10.6 
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forward(enc_X, dec_X, *args) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


predict_step (batch, device, num_steps, save_attention_weights=False) 


Defined in Section 10.7.6 


class d21.torch.FashionMNIST (batch_size=64, resize=(28, 28)) 
Bases: DataModule (page 1075) 


The Fashion-MNIST dataset. 
Defined in Section 4.2 


get_dataloader (train) 
Defined in Section 4.2 


text_labels (indices) 


Return text labels. 
Defined in Section 4.2 


visualize(batch, nrows=1, ncols=8, labels=[]) 


Defined in Section 4.2 


class d21.torch.GRU(num_inputs, num_hiddens, num_layers, dropout=0) 
Bases: RNN (page 1081) 


The multilayer GRU model. 
Defined in Section 10.3 


class d21.torch.HyperParameters 


Bases: object 
The base class of hyperparameters. 


save_hyperparameters (ignore=[]) 


Save function arguments into class attributes. 


Defined in Section B.7 
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class d21.torch.LeNet (/r=0.1, num_classes=10) 
Bases: Classifier (page 1075) 


The LeNet-5 model. 
Defined in Section 7.6 


class d21.torch.LinearRegression(/r) 
Bases: Module (page 1078) 


The linear regression model implemented with high-level APIs. 
Defined in Section 3.5 


configure_optimizers() 


Defined in Section 3.5 


forward(xX) 
Defined in Section 3.5 


get_w_b() 
Defined in Section 3.5 


loss (y_hat, y) 
Defined in Section 3.5 


class d21.torch.LinearRegressionScratch(num_inputs, lr, sigma=0.01) 
Bases: Module (page 1078) 


The linear regression model implemented from scratch. 
Defined in Section 3.4 


configure_optimizers() 


Defined in Section 3.4 


forward(x) 
Defined in Section 3.4 


loss (y_hat, y) 
Defined in Section 3.4 


class d21.torch.Module(plot_train_per_epoch=2, plot_valid_per_epoch=1) 
Bases: Module, HyperParameters (page 1077) 


The base class of models. 
Defined in Section 3.2 


apply_init (inputs, init=None) 
Defined in Section 6.4 
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configure_optimizers() 


Defined in Section 4.3 


forward(xX) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


loss (y_hat, y) 
plot (key, value, train) 
Plot a point in animation. 


training_step (batch) 
validation_step (batch) 


class d21.torch.MTFraEng (batch_size, num_steps=9, num_train=512, 
num_val=128) 


Bases: DataModule (page 1075) 
The English-French dataset. 
Defined in Section 10.5 


build(src_sentences, tgt_sentences) 


Defined in Section 10.5.3 


get_dataloader (train) 
Defined in Section 10.5.3 


class d21.torch.MultiHeadAttention(num_hiddens, num_heads, dropout, 
bias=False, **kwargs) 


Bases: Module (page 1078) 
Multi-head attention. 
Defined in Section 11.5 


forward (queries, keys, values, valid_lens) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 
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Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


transpose_output (X) 


Reverse the operation of transpose_qkv. 
Defined in Section 11.5 


transpose_qkv(X) 


Transposition for parallel computation of multiple attention heads. 


Defined in Section 11.5 


class d21.torch.PositionalEncoding (num_hiddens, dropout, max_len=1000) 


Bases: Module 
Positional encoding. 
Defined in Section 11.6 
forward(X) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.PositionWiseFFN(ffn_num_hiddens, ffn_num_outputs) 


Bases: Module 

The positionwise feed-forward network. 
Defined in Section 11.7 

forward(X) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 
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class d21.torch.ProgressBoard(xlabel=None, ylabel=None, xlim=None, 
ylim=None, xscale=’linear’, yscale=’linear’, 
Is=[’-’, ’--’, ’-.’, ’:’], colors=[’CO’, ’C1’, ’C2’, 
’C3’], fig=None, axes=None, figsize=(3.5, 2.5), 
display=True) 


Bases: HyperParameters (page 1077) 
The board that plots data points in animation. 
Defined in Section 3.2 


draw(x, y, label, every_n=1) 
Defined in Section B.7 


class d21.torch.Residual (num_channels, use_Ixlconv=False, strides=1) 


Bases: Module 
The Residual block of ResNet models. 
Defined in Section 8.6 


forward(X) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.ResNeXtBlock(num_channels, groups, bot_mul, 


use_1lxlconv=False, strides=1) 


Bases: Module 

The ResNeXt block. 
Defined in Section 8.6.2 
forward(xX) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 
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class d21.torch.RNN(num_inputs, num_hiddens) 
Bases: Module (page 1078) 


The RNN model implemented with high-level APIs. 
Defined in Section 9.6 


forward (inputs, H=None) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.RNNLM(rnn, vocab_size, lr=0.01) 
Bases: RNNLMScratch (page 1082) 


The RNN-based language model implemented with high-level APIs. 
Defined in Section 9.6 
init_params() 
output_layer (hiddens) 
Defined in Section 9.5 


class d21.torch.RNNLMScratch(rnn, vocab_size, lr=0.01) 
Bases: Classifier (page 1075) 


The RNN-based language model implemented from scratch. 
Defined in Section 9.5 


forward(X, state=None) 
Defined in Section 9.5 


init_params() 
one_hot(X) 
Defined in Section 9.5 


output_layer (rnn_outputs) 
Defined in Section 9.5 


predict (prefix, num_preds, vocab, device=None) 


Defined in Section 9.5 
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training_step(batch) 
validation_step (batch) 
class d21.torch.RNNScratch(num_inputs, num_hiddens, sigma=0.01) 
Bases: Module (page 1078) 
The RNN model implemented from scratch. 
Defined in Section 9.5 


forward (inputs, state=None) 
Defined in Section 9.5 


class d21.torch.Seq2Seq(encoder, decoder, tgt_pad, Ir) 
Bases: EncoderDecoder (page 1076) 


The RNN encoder—decoder for sequence to sequence learning. 
Defined in Section 10.7.3 


configure_optimizers() 
Defined in Section 4.3 


validation_step (batch) 


class d21.torch.Seq2SeqEncoder (vocab_size, embed_size, num_hiddens, 
num_layers, dropout=0) 


Bases: Encoder (page 1076) 
The RNN encoder for sequence-to-sequence learning. 
Defined in Section 10.7 


forward(X, *args) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.SGD(params, Ir) 
Bases: HyperParameters (page 1077) 


Minibatch stochastic gradient descent. 


Defined in Section 3.4 
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step() 
zero_grad() 
class d21.torch.SoftmaxRegression(num_outputs, lr) 
Bases: Classifier (page 1075) 
The softmax regression model. 
Defined in Section 4.5 


forward(xX) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.SyntheticRegressionData(w, b, noise=0.01, num_train=1000, 
num_val=1000, batch_size=32) 


Bases: DataModule (page 1075) 
Synthetic data for linear regression. 
Defined in Section 3.3 


get_dataloader (train) 
Defined in Section 3.3 


class d21.torch.TimeMachine(batch_size, num_steps, num_train=10000, 
num_val=5000) 


Bases: DataModule (page 1075) 
The Time Machine dataset. 
Defined in Section 9.2 


build(raw_text, vocab=None) 
Defined in Section 9.2 


get_dataloader (train) 
Defined in Section 9.3.3 


class d21.torch. Trainer (max_epochs, num_gpus=0, gradient_clip_val=0) 


Bases: HyperParameters (page 1077) 


The base class for training models with data. 
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Defined in Section 3.2.2 


clip_gradients(grad_clip_val, model) 
Defined in Section 9.5 


fit (model, data) 
fit_epoch() 
Defined in Section 3.4 


prepare_batch(batch) 
Defined in Section 6.7 


prepare_data (data) 
prepare_model (model) 
Defined in Section 6.7 


class d21.torch.TransformerEncoder (vocab_size, num_hiddens, ffn_num_hiddens, 
num_heads, num_blks, dropout, 
use_bias=False) 


Bases: Encoder (page 1076) 
The Transformer encoder. 
Defined in Section 11.7.4 


forward(X, valid_lens) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 


Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch.TransformerEncoderBlock (num_hiddens, ffn_num_hiddens, 
num_heads, dropout, use_bias=False) 


Bases: Module 
The Transformer encoder block. 
Defined in Section 11.7.2 


forward(X, valid_lens) 


Defines the computation performed at every call. 


Should be overridden by all subclasses. 
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Note: Although the recipe for forward pass needs to be defined within this function, 
one should call the Module (page 1078) instance afterwards instead of this since the 
former takes care of running the registered hooks while the latter silently ignores them. 


class d21.torch. Vocab (tokens=[], min_freq=0, reserved_tokens=[]) 


Bases: object 
Vocabulary for text. 


to_tokens (indices) 


property unk 


B.8.2 Functions 


d21.torch.add_to_class(Class) 


Register functions as methods in created class. 
Defined in Section 3.2 


d21.torch. bleu(pred_segq, label_seq, k) 
Compute the BLEU. 


Defined in Section 10.7.6 


d21.torch.check_len(a, n) 
Check the length of a list. 


Defined in Section 9.5 


d21.torch.check_shape(a, shape) 
Check the shape of a tensor. 


Defined in Section 9.5 


d21.torch.corr2d(Xx, K) 


Compute 2D cross-correlation. 
Defined in Section 7.2 


d21.torch.cpu() 
Get the CPU device. 


Defined in Section 6.7 


d21. torch. gpu(i=0) 
Get a GPU device. 


Defined in Section 6.7 
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d21.torch. init_cnn(module) 


Initialize weights for CNNs. 
Defined in Section 7.6 
d21.torch. init_seq2seq(module) 
Initialize weights for sequence-to-sequence learning. 
Defined in Section 10.7 
d21.torch.masked_softmax (X, valid_lens) 
Perform softmax operation by masking elements on the last axis. 
Defined in Section 11.3 
d21.torch.num_gpus() 
Get the number of available GPUs. 
Defined in Section 6.7 


d21.torch.plot (xX, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None, 
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ylim=None, xscale=’linear’, yscale=’linear’, fmts=(’-’, ’m--’, ’g-.’, 
r:’), figsize=(3.5, 2.5), axes=None) 


Plot data points. 


Defined in Section 2.4 


d21.torch.set_axes (axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend) 


Set the axes for matplotlib. 
Defined in Section 2.4 


d21.torch.set_figsize (figsize=(3.5, 2.5)) 
Set the figure size for matplotlib. 


Defined in Section 2.4 


d21.torch. show_heatmaps (matrices, xlabel, ylabel, titles=None, figsize=(2.5, 2.5), 
cmap=’Reds’) 


Show heatmaps of matrices. 


Defined in Section 11.1 


d21.torch. show_list_len_pair_hist (legend, xlabel, ylabel, xlist, ylist) 


Plot the histogram for list length pairs. 
Defined in Section 10.5 


d21.torch. try_all_gpus() 
Return all available GPUs, or [cpu(Q),] if no GPU exists. 


Defined in Section 6.7 
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d21. torch. try_gpu(i=0) 
Return gpu(i) if exists, otherwise return cpu(). 
Defined in Section 6.7 
d21.torch.use_svg_display() 


Use the svg format to display a plot in Jupyter. 


Defined in Section 2.4 
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